Points of Note: What is the best way to count the number of occurrences of all words in a string, and which method is most efficient?

I was thinking the other day about what the best way to find out how many times a word appears in a string and what the most efficient way of doing this is and it amazed me that there are a variety of ways that this can be done. And so I thought I would test this to find out which way is best. In doing so, my intention was to determine a variety of ways to count the number of occurrences of all words in a string and from that to establish which method was the most efficient. Using C#, I have implemented four different ways of doing this and timed the results. Each method was done ten times and the average taken of the results.

Each method was tested over a variety of different input strings to determine which method was best for which circumstance. As a basis, each input string had 1036512 words. This was because the first text file used from http://norvig.com/big.txt was that number of words and this is a reasonable benchmark size that doesn't take too long to run.

Grouping

The grouping of items and a corresponding count has many uses and there are a variety of ways this can be implemented. This is similar in a sense to the GROUP BY statement in a SQL statement using a COUNT() function that would be implemented by the following SQL query:

Select word, count(word) from table group by word

This shows that there are two components required for a query. The word and the count. As a basis of each method a requirement was to return an array of objects that each had a word and a count. This lead to a method signature used for each grouping technique being defined such that it took a string as input and returned an array of objects that contained an int Count and a string Word.

Such that:

public class WordCount
{
    public string Word;
    public int Count;
}

And method signature:

WordCount [] RunGrouping(string input);

Which would run the method and time the duration of the grouping. Each method would split the string into a string array of words, the clock would start, the words grouped, the clock stopped and the results returned to the calling class.

The four procedures used consisted of:

A linear count of items using two nested foreach loops to count all words
A sort of the words then a count
A count using a Dictionary
A count using a LINQ query

The string passed to the method to do the work had had all non-alpha characters removed and was in uppercase so that words were not incorrectly unique due to extra characters or case sensitivity.

Linear Count

The linear count of items used a List<WordCount> to hold the items and iterated through each word in the original string passed to it. Each word was then iterated through the List and if it was present in the list, the count for the corresponding word was increased. If the inner loop completes without finding the word, the word was added to the list with a count of 1. The list was then converted to an array and the method completed. Although not very linear, it was named this as there is one pass for each word in the original list. The code used is shown below:

public WordCount[] RunGrouping(string input)
{
        string[] words = input.Split(' ');
        List<WordCount> list = new List<WordCount>();
        bool found = false;
        WordCount[] counted;
        DateTime start = DateTime.Now;

        foreach (string word in words)
        {
            foreach (WordCount w in list)
            {
                if (w.Word == word)
                {
                    w.Count++;
                    found = true;
                    break;
                }
                else
                    found = false;
            }

            if (!found)
            {
                list.Add(new WordCount { Word = word, Count = 1 });
                found = true;
            }
        }

        counted = list.ToArray<WordCount>();
        DateTime end = DateTime.Now;
        Duration = end - start;
        return counted;
}

The efficiency of this query is evidently quite bad as there are two loops and the inner loop is increasing as more words get added to the list. The Big O notation for this was O(N^2)

Sort of words and then a count

The second procedure tried was a sort of all the words in an array and then a count of how many of each word appeared. The logic behind this method is slightly easier as there is only one foreach loop but a temp variable is used when a new item in the list is encountered. The code for this query was:

public TimeSpan Duration {get; set;}

public WordCount[] RunGrouping(string input)
{
        string[] words = input.Split(' ');
        List<WordCount> list = new List<WordCount>();
        WordCount[] counted;        
        int count = 0;
        DateTime start = DateTime.Now;

        Array.Sort(words);  
        string tempVar = words[0];
      
        foreach(string word in words)
        {
            if (word == tempVar)
            {
                count++;
            }
            else
            {
                list.Add(new WordCount { Word = tempVar, Count = count });
                tempVar = word;
                count = 1;
            }
        }

        list.Add(new WordCount{Word = tempVar, Count = count});

        counted = list.ToArray<WordCount>();
        DateTime end = DateTime.Now;
        Duration = end - start;
        return counted;
}

The efficiency of this query in Big O notation can be seen as O(1) as there is only one for loop instead of two and the iteration is linear. The sorting of the array at the start is a .NET function that is based on the Quicksort routine.

Dictionary Grouping

The Dictionary method uses a Dictionary object of Dictionary<string, int> and can be used as there will only ever be one instance of a word in the returned array, and so the dictionary Key can be the word and the Value the word count. The efficiency of this query is O(N) and is simpler to understand than the previous two methods. The code used is shown below.

public TimeSpan Duration {get; set;}

public WordCount[] RunGrouping(string input)
{
        string[] words = input.Split(' ');
        Dictionary<string, int> dict = new Dictionary<string, int>();            
        int i = 0;
        WordCount[] counted;
        DateTime start = DateTime.Now;

        foreach (string word in words)
        {
            if (!dict.ContainsKey(word))
                dict.Add(word, 1);
            else
                dict[word]++;
        }                       

        counted = new WordCount[dict.Count];           

        foreach (var pair in dict)
            counted[i++] = new WordCount { Word = pair.Key, Count = pair.Value };

        DateTime end = DateTime.Now;
        Duration = end - start;          
        return counted;           
}

The efficiency of this query could be considered O(N^2) due to the foreach loop and the fact the dict.ContainsKey() method has a call to a private system method int FindEntry(TKey key) that iterates over the collection to find a match.

LINQ Grouping

This method is arguably the easiest program flow to understand as the syntax is similar to a SQL statement GROUP BY but is the most difficult to understand what it actually does (and probably requires a separate post to discuss this). The code used is shown below:

public TimeSpan Duration {get; set;}

public WordCount [] RunGrouping(string input)
{                       
        string[] words = input.Split(' ');
        WordCount[] counted;
        DateTime start = DateTime.Now;
           
        counted = words.GroupBy(w => w).Select(o => new WordCount { Word = o.Key, Count = o.Count() }).ToArray<WordCount>();

        DateTime end = DateTime.Now;
        Duration = end - start;
        return counted;
}

The efficiency of this query is difficult to understand by looking at the code and so will not be given an O notation.

Results from the Testing:

As mentioned, each grouping method was run ten times over the original word list from http://norvig.com/big.txt to determine the most efficient grouping routine. The results are graphed below:

This shows that the linear count is dramatically less efficient than the other three counts and so was removed from the remaining comparisons. Although the linear count does show poor performance here, it did show better results when run against a list of 1 letter words (as shown below) but was still not included for future tests.

The results at this point appeared to be that the Linq grouping was the most efficient, followed by the Dictionary Count, the Sort and Count and lastly the Linear Count.

It was then decided to test the efficiency of the Linq and the Dictionary groupings exclusively for different scenarios to establish which was the better for different circumstances as these had the best efficiency over the most ranges so far. Firstly, random words of varying length were tested with combinations between 26^1 to 26^10. The number of combinations of words are shown below. Each text file used remained at the 1036512 words so that a comparison could be done to other results. It is important to note, however, that as the test input only had 1026512 words to begin with, only combinations up to 26^5 (11881376 words) can be considered relevant.

26^1 = 26 combinations

26^2 = 676

26^3 = 17576

26^4 = 456976

26^5 = 11881376

26^6 = 308915776

26^7 = 8031810176

26^8 = 208827064576

26^9= 5429503678976

26^10 = 141167095653376

As the number of combinations increased, the number of duplicates decreased and so it was estimated that the search time to determine whether the item existed already in the grouping routine would also increase. The results obtained are shown below:

From this it can be seen that for combinations up to about 3 letters, or 17576 different combinations of words, the Linq grouping and the Dictionary grouping are similar in efficiency. When the number of combinations increases beyond that, it would appear that the Dictionary grouping is more efficient than the Linq grouping.

A final test was done to determine the range and extent that this occurred for. Groups of similar words for four letter words were created in batches of size repeating from 1 to 100000 words and this was plotted for various values of interest. For example, a batch size of 1 repeating was one word repeating 1036512 times. A batch size of 100 was 100 words repeating 10365.12 times (100 * 10365.12 = 1036512 words). The results are shown below:

This table is logarithmic and points out that when there are between 1 – 1000 repeated words the Linq query is more efficient. However, between 1000 – 20000, either method will give similar results; although it must be said that the Dictionary grouping in this range gives more consistent results with less variation in duration. After 20000 repeated words, the Dictionary count is evidently more efficient. This is further outlined in the graph below.

Conclusion:

From the preceding observations, the following conclusions could be justified:

On average, both the Linq grouping and the Dictionary grouping are more efficient than the Linear Count and the Sort then Count approaches.

In comparison of the Linq and Dictionary approaches,

The Linq grouping is better for words where there is less than 1000 different four letter words
The Dictionary grouping and Linq grouping give similar results where there is between 1000 and 20000 different four letter words
The Dictionary grouping is better when there are more than 20000 different four letter words
The Dictionary grouping has less variation in the execution time overall.

The results are useful to take into account if grouping larger sets (>20000) of items such as primary keys or GUIDs to use the Dictionary approach. As well, when doing a lot of groupings of small sets (< 1000) the Linq approach works best.

A future post may investigate words of more or less than four letters as well as the profiling of the Dictionary and Linq Methods to determine where the performance hits are in each at different counts of words repeating.

Points of Note

Sunday, November 4, 2012

What is the best way to count the number of occurrences of all words in a string, and which method is most efficient?