I was thinking the other day about what the best way to find
out how many times a word appears in a string and what the most efficient way
of doing this is and it amazed me that there are a variety of ways that this can
be done. And so I thought I would test this to find out which way is best. In doing so, my intention was to determine a variety of
ways to count the number of occurrences of all words in a string and from that
to establish which method was the most efficient. Using C#, I have implemented
four different ways of doing this and timed the results. Each method was done ten times and the average taken of the results.
Each method was tested over a variety of different input
strings to determine which method was best for which circumstance. As a basis,
each input string had 1036512 words. This was because the first text file used from
http://norvig.com/big.txt was that number of words and this is a reasonable benchmark size that doesn't take too long to run.
Grouping
The grouping of items and a corresponding count has many
uses and there are a variety of ways this can be implemented. This is similar
in a sense to the GROUP BY statement in a SQL statement using a COUNT() function that would be implemented by the following SQL query:
Select word, count(word) from table group by word
This shows that there are two components required for a
query. The word and the count. As a basis of each method a requirement was to
return an array of objects that each had a word and a count. This lead to a
method signature used for each grouping technique being defined such that it
took a string as input and returned an array of objects that contained an int
Count and a string Word.
Such that:
public class WordCount
{
public string Word;
public int Count;
}
And method signature:
WordCount [] RunGrouping(string input);
Which would run the method and time the duration of
the grouping. Each method would split the string into a string array of words,
the clock would start, the words grouped, the clock stopped and the results
returned to the calling class.
The four procedures used consisted of:
- A linear count of items using two nested foreach loops to count all words
- A sort of the words then a count
- A count using a Dictionary
- A count using a LINQ query
The string passed to the method to do the work had had
all non-alpha characters removed and was in uppercase so that words were not
incorrectly unique due to extra characters or case sensitivity.
Linear Count
The linear count of items used a List<WordCount>
to hold the items and iterated through each word in the original string passed
to it. Each word was then iterated through the List and if it was present in
the list, the count for the corresponding word was increased. If the inner loop
completes without finding the word, the word was added to the list with a count
of 1. The list was then converted to an array and the method completed. Although not very linear, it was named this as there is one pass for each word in the original list. The
code used is shown below:
public WordCount[] RunGrouping(string input)
{
string[] words = input.Split(' ');
List<WordCount> list = new List<WordCount>();
bool found = false;
WordCount[] counted;
DateTime start = DateTime.Now;
foreach (string word in words)
{
foreach (WordCount w in list)
{
if (w.Word == word)
{
w.Count++;
found = true;
break;
}
else
found = false;
}
if (!found)
{
list.Add(new WordCount { Word = word, Count = 1 });
found = true;
}
}
counted = list.ToArray<WordCount>();
DateTime end = DateTime.Now;
Duration = end - start;
return counted;
}
The efficiency of this query is evidently quite bad as
there are two loops and the inner loop is increasing as more words get added to
the list. The Big O notation for this was O(N^2)
Sort of words and then a count
The second procedure tried was a sort of all the words
in an array and then a count of how many of each word appeared. The logic behind
this method is slightly easier as there is only one foreach loop but a temp
variable is used when a new item in the list is encountered. The code for this
query was:
public TimeSpan Duration {get; set;}
public WordCount[] RunGrouping(string input)
{
string[] words = input.Split(' ');
List<WordCount> list = new List<WordCount>();
WordCount[] counted;
int count = 0;
DateTime start = DateTime.Now;
Array.Sort(words);
string tempVar = words[0];
foreach(string word in words)
{
if (word == tempVar)
{
count++;
}
else
{
list.Add(new WordCount { Word = tempVar, Count = count });
tempVar = word;
count = 1;
}
}
list.Add(new WordCount{Word = tempVar, Count = count});
counted = list.ToArray<WordCount>();
DateTime end = DateTime.Now;
Duration = end - start;
return counted;
}
The efficiency of this query in Big O notation can be
seen as O(1) as there is only one for loop instead of two and the iteration is
linear. The sorting of the array at the start is a .NET function that is based
on the Quicksort routine.
Dictionary Grouping
The Dictionary method uses a Dictionary object of
Dictionary<string, int> and can be used as there will only ever be one
instance of a word in the returned array, and so the dictionary Key can be the word and the Value the word count. The efficiency of this query is O(N) and is simpler to understand than the previous two methods. The code used
is shown below.
public TimeSpan Duration {get; set;}
public WordCount[] RunGrouping(string input)
{
string[] words = input.Split(' ');
Dictionary<string, int> dict = new Dictionary<string, int>();
int i = 0;
WordCount[] counted;
DateTime start = DateTime.Now;
foreach (string word in words)
{
if (!dict.ContainsKey(word))
dict.Add(word, 1);
else
dict[word]++;
}
counted = new WordCount[dict.Count];
foreach (var pair in dict)
counted[i++] = new WordCount { Word = pair.Key, Count = pair.Value };
DateTime end = DateTime.Now;
Duration = end - start;
return counted;
}
The efficiency of this query could be considered O(N^2)
due to the foreach loop and the fact the dict.ContainsKey() method has a call
to a private system method int FindEntry(TKey key) that iterates over the
collection to find a match.
LINQ Grouping
This method is arguably the easiest program flow to
understand as the syntax is similar to a SQL statement GROUP BY but is the most
difficult to understand what it actually does (and probably requires a separate
post to discuss this). The code used is shown below:
public TimeSpan Duration {get; set;}
public WordCount [] RunGrouping(string input)
{
string[] words = input.Split(' ');
WordCount[] counted;
DateTime start = DateTime.Now;
counted = words.GroupBy(w => w).Select(o => new WordCount { Word = o.Key, Count = o.Count() }).ToArray<WordCount>();
DateTime end = DateTime.Now;
Duration = end - start;
return counted;
}
The efficiency of this query is difficult to
understand by looking at the code and so will not be given an O notation.
Results from the Testing:
As mentioned, each grouping method was run ten times
over the original word list from
http://norvig.com/big.txt
to determine the most efficient grouping routine. The results are
graphed below:
This shows that the linear count is dramatically less
efficient than the other three counts and so was removed from the remaining
comparisons. Although the linear count does show poor performance here, it did
show better results when run against a list of 1 letter words (as shown below)
but was still not included for future tests.
The results at this point appeared to be that the Linq
grouping was the most efficient, followed by the Dictionary Count, the Sort and
Count and lastly the Linear Count.
It was then decided to test the efficiency of the Linq and
the Dictionary groupings exclusively for different scenarios to establish which
was the better for different circumstances as these had the best efficiency
over the most ranges so far. Firstly, random words of varying length were
tested with combinations between 26^1 to 26^10. The number of combinations of
words are shown below. Each text file used remained at the 1036512 words so
that a comparison could be done to other results. It is important to note, however, that as the test input only had 1026512 words to begin with, only combinations up to 26^5 (11881376 words) can be considered relevant.
As the number of combinations increased, the number of
duplicates decreased and so it was estimated that the search time to determine
whether the item existed already in the grouping routine would also increase.
The results obtained are shown below:
From this it can be seen that for combinations up to about 3
letters, or 17576 different combinations of words, the Linq grouping and the
Dictionary grouping are similar in efficiency. When the number of combinations
increases beyond that, it would appear that the Dictionary grouping is more
efficient than the Linq grouping.
A final test was done to determine the range and extent that
this occurred for. Groups of similar words for four letter words were created
in batches of size repeating from 1 to 100000 words and this was plotted for
various values of interest. For example, a batch size of 1 repeating was one
word repeating 1036512 times. A batch
size of 100 was 100 words repeating 10365.12 times (100 * 10365.12 = 1036512 words). The results
are shown below:
This table is logarithmic and points out that when there are
between 1 – 1000 repeated words the Linq query is more efficient. However,
between 1000 – 20000, either method will give similar results; although it must
be said that the Dictionary grouping in this range gives more consistent
results with less variation in duration. After 20000 repeated words, the
Dictionary count is evidently more efficient. This is further outlined in the
graph below.
Conclusion:
From the preceding observations, the following conclusions
could be justified:
- On average, both the Linq grouping and the Dictionary
grouping are more efficient than the Linear Count and the Sort then Count approaches.
In comparison of the Linq and Dictionary approaches,
- The Linq grouping
is better for words where there is less than 1000 different four letter words
- The Dictionary grouping and Linq grouping give
similar results where there is between 1000 and 20000 different four letter words
- The Dictionary grouping is better when there are
more than 20000 different four letter words
- The Dictionary grouping has less variation in
the execution time overall.
The results are useful to take into account if grouping larger sets (>20000) of items such as primary keys or GUIDs to use the Dictionary approach. As well, when doing a lot of groupings of small sets (< 1000) the Linq approach works best.
A future post may investigate words of more or less than
four letters as well as the profiling of the Dictionary and Linq Methods to
determine where the performance hits are in each at different counts of words
repeating.