Friday, June 19, 2009

Anagrams Part 3 – Using a Map

So how can we speed up our anagram algorithm? Well, how about we rearrange the input word list so that it stores all anagrams for us? This would be a one-off operation at the start of the program. From then on we would simply find all anagrams with a simple look up operation.

Hmmm how can we do that?

Well all anagrams of a particular word are made up of the same letters. If we sort all the anagrams into lexicographical order we get the same ordered sequence of characters. For example the following three words:

part trap rapt

are made up of the sorted sequence aprt. Lets call this the key, and the three original words the values.

What we need is a data structure that can store the key and values like this:

aprt => part trap rapt

This can be represented nicely in C++ as follows:

string => vector<string>

We can use a map to store this:

map<string, vector<string> >

We build the map by reading each word from the word list, sorting it, then adding it to the map and appending the unsorted word to the associated vector. We will end up with an in-memory representation like this:

aprt => part trap rapt
act => act cat
aeprs => spare pares

The code can explain this much better than I can, here is the complete program:

#include <iostream>
#include <hash_map>
#include <vector>
#include <string>
#include <fstream>
#include <algorithm>

using namespace std;
using stdext::hash_map;

typedef vector<string> value_type;
typedef hash_map<string, value_type> Dict;
typedef Dict::const_iterator dict_iter;

int main()
{	
	Dict dict;
	int word_count = 0;
	
	ifstream in_file("words.txt");
	if (!in_file) {
		cerr << "Input file not found" << endl;
		exit(1);
	}
	
	string s, sorted_s;
	// read a word, sort the chars then use it as a key to the hash table. Append the original
	// unsorted word to the vector that is the value.
	while (in_file >> s) {
		sorted_s = s;
		sort(sorted_s.begin(), sorted_s.end());
		dict[sorted_s].push_back(s);
		word_count++;
	}
	
	cout << word_count << " words read" << endl;
	
	string str;
	while (str != "q") {
		cout << "\nEnter letters (q to quit): ";
		cin >> str;
		sort(str.begin(), str.end());
		dict_iter iter = dict.find(str);		// find the sorted string in the hash table
		if (iter != dict.end()) {				// if it exists...
			value_type vec = (*iter).second;	// ...grab the vector 
			cout << vec.size() << (vec.size() == 1 ? " anagram -> " : " anagrams -> ");
			// output the anagrams
			copy (vec.begin(), vec.end(), ostream_iterator<string>(cout, " "));
			cout << endl;
		}
		else 
			if (str != "q") cout << "word not found" << endl;
	}
	return 0;
}

You can see the map is built in only 4 lines of C++ (lines 29 to 32). To keep things interesting I have used a hashmap instead. This is a drop-in replacement for std::map offering potentially better performance. Note that it lives in the stdext namespace if you use Visual Studio.

If you run this program you will find it returns answers instantaneously. Even if you enter a long nonsensical string it will determine that there are no anagrams instantly. In fact, using a hashmap can offer us potentially O(1) performance i.e. constant time. This means the performance does not depend on the length of the input string.

What this teaches us is that it is sometimes beneficial to change the representation of the input data in order to speed up the program. A map (or hashmap) is an extremely efficient way of storing related data. If you think about it we have actually created a database, where each <key, value> pair is a record in the database. The value is the data we require (the anagrams) and the key is the index (the sorted string).

Sunday, June 7, 2009

Anagrams Part 2 – Binary Search

One of the problems with the previous algorithm (see Anagrams Part 1) is that it performed a linear search of the word list to find each permutation. That is, if there are 38000 words then the worst case will be 38000 comparisons. And for anagrams most cases will be worst cases. That’s fine if the word list is a random jumble, but most word lists are stored in alphabetical order. That means we can choose a much better search algorithm – the binary search.

Binary search works with sorted data. Imagine if you had a phone book and wanted to find if “Flubberblubber” is in the book. You wouldn’t start at the beginning and scan through to the end would you? No, you’d probably open the book roughly halfway through. You might see that all the names start with “M”. You’d then open the book again roughly halfway between the start and the “M” page. You might hit names that start with “H”. So again you’d open the book halfway to “H”. This time you hit “E”. Now you’d open the book halfway between “E” and “H”. See how you are homing in on the desired “F” pages after only 3 lookups?

In fact, if the phone book contained 1 million numbers you’d get to the desired name (or find it didn’t exist) within 20 lookups if you followed this method! So a worst case of 20 compared to 1,000,000 using linear search. Even better, binary search scales remarkably well. If the phone book contained 1 billion numbers, the number of lookups would only increase to about 30. Now that’s impressive. It should be obvious that the reason this works is that we can discard 50% of the remaining pages of the phone book every time we do a lookup. In fact binary search is an O(log n) algorithm, while linear search is O(n).

So does the C++ Standard Library contain this useful facility? Of course it does! There is a binary_search() function in the <algorithm> header that does exactly what we want.

Simply replace the following lines from the original program:

vector<string>::const_iterator iter = find(dict.begin(), dict.end(), str);
if (iter != dict.end())
    // we have found an anagram so print it
    cout << *iter << " ";

with:

if (binary_search(dict.begin(), dict.end(), str))
    // we have found an anagram so print it
    cout << str << " ";

Not only is it faster it’s also a bit clearer (no iterators). Running the program to find anagrams of integral results in a near instantaneous result, as opposed to 21 secs with linear search.

So is this the best we can do? Not really. Running the program on a 16 character string still takes minutes to return an answer.

We’ve solved the problem of searching but unfortunately we are still generating in the order of n! permutations of the input string. This is now the bottleneck, so in the next part we’ll look at reducing that.