Sunday, June 7, 2009

Anagrams Part 2 – Binary Search

One of the problems with the previous algorithm (see Anagrams Part 1) is that it performed a linear search of the word list to find each permutation. That is, if there are 38000 words then the worst case will be 38000 comparisons. And for anagrams most cases will be worst cases. That’s fine if the word list is a random jumble, but most word lists are stored in alphabetical order. That means we can choose a much better search algorithm – the binary search.

Binary search works with sorted data. Imagine if you had a phone book and wanted to find if “Flubberblubber” is in the book. You wouldn’t start at the beginning and scan through to the end would you? No, you’d probably open the book roughly halfway through. You might see that all the names start with “M”. You’d then open the book again roughly halfway between the start and the “M” page. You might hit names that start with “H”. So again you’d open the book halfway to “H”. This time you hit “E”. Now you’d open the book halfway between “E” and “H”. See how you are homing in on the desired “F” pages after only 3 lookups?

In fact, if the phone book contained 1 million numbers you’d get to the desired name (or find it didn’t exist) within 20 lookups if you followed this method! So a worst case of 20 compared to 1,000,000 using linear search. Even better, binary search scales remarkably well. If the phone book contained 1 billion numbers, the number of lookups would only increase to about 30. Now that’s impressive. It should be obvious that the reason this works is that we can discard 50% of the remaining pages of the phone book every time we do a lookup. In fact binary search is an O(log n) algorithm, while linear search is O(n).

So does the C++ Standard Library contain this useful facility? Of course it does! There is a binary_search() function in the <algorithm> header that does exactly what we want.

Simply replace the following lines from the original program:

vector<string>::const_iterator iter = find(dict.begin(), dict.end(), str);
if (iter != dict.end())
    // we have found an anagram so print it
    cout << *iter << " ";

with:

if (binary_search(dict.begin(), dict.end(), str))
    // we have found an anagram so print it
    cout << str << " ";

Not only is it faster it’s also a bit clearer (no iterators). Running the program to find anagrams of integral results in a near instantaneous result, as opposed to 21 secs with linear search.

So is this the best we can do? Not really. Running the program on a 16 character string still takes minutes to return an answer.

We’ve solved the problem of searching but unfortunately we are still generating in the order of n! permutations of the input string. This is now the bottleneck, so in the next part we’ll look at reducing that.

No comments:

Post a Comment