Friday, June 19, 2009

Anagrams Part 3 – Using a Map

So how can we speed up our anagram algorithm? Well, how about we rearrange the input word list so that it stores all anagrams for us? This would be a one-off operation at the start of the program. From then on we would simply find all anagrams with a simple look up operation.

Hmmm how can we do that?

Well all anagrams of a particular word are made up of the same letters. If we sort all the anagrams into lexicographical order we get the same ordered sequence of characters. For example the following three words:

part trap rapt

are made up of the sorted sequence aprt. Lets call this the key, and the three original words the values.

What we need is a data structure that can store the key and values like this:

aprt => part trap rapt

This can be represented nicely in C++ as follows:

string => vector<string>

We can use a map to store this:

map<string, vector<string> >

We build the map by reading each word from the word list, sorting it, then adding it to the map and appending the unsorted word to the associated vector. We will end up with an in-memory representation like this:

aprt => part trap rapt
act => act cat
aeprs => spare pares

The code can explain this much better than I can, here is the complete program:

#include <iostream>
#include <hash_map>
#include <vector>
#include <string>
#include <fstream>
#include <algorithm>

using namespace std;
using stdext::hash_map;

typedef vector<string> value_type;
typedef hash_map<string, value_type> Dict;
typedef Dict::const_iterator dict_iter;

int main()
{	
	Dict dict;
	int word_count = 0;
	
	ifstream in_file("words.txt");
	if (!in_file) {
		cerr << "Input file not found" << endl;
		exit(1);
	}
	
	string s, sorted_s;
	// read a word, sort the chars then use it as a key to the hash table. Append the original
	// unsorted word to the vector that is the value.
	while (in_file >> s) {
		sorted_s = s;
		sort(sorted_s.begin(), sorted_s.end());
		dict[sorted_s].push_back(s);
		word_count++;
	}
	
	cout << word_count << " words read" << endl;
	
	string str;
	while (str != "q") {
		cout << "\nEnter letters (q to quit): ";
		cin >> str;
		sort(str.begin(), str.end());
		dict_iter iter = dict.find(str);		// find the sorted string in the hash table
		if (iter != dict.end()) {				// if it exists...
			value_type vec = (*iter).second;	// ...grab the vector 
			cout << vec.size() << (vec.size() == 1 ? " anagram -> " : " anagrams -> ");
			// output the anagrams
			copy (vec.begin(), vec.end(), ostream_iterator<string>(cout, " "));
			cout << endl;
		}
		else 
			if (str != "q") cout << "word not found" << endl;
	}
	return 0;
}

You can see the map is built in only 4 lines of C++ (lines 29 to 32). To keep things interesting I have used a hashmap instead. This is a drop-in replacement for std::map offering potentially better performance. Note that it lives in the stdext namespace if you use Visual Studio.

If you run this program you will find it returns answers instantaneously. Even if you enter a long nonsensical string it will determine that there are no anagrams instantly. In fact, using a hashmap can offer us potentially O(1) performance i.e. constant time. This means the performance does not depend on the length of the input string.

What this teaches us is that it is sometimes beneficial to change the representation of the input data in order to speed up the program. A map (or hashmap) is an extremely efficient way of storing related data. If you think about it we have actually created a database, where each <key, value> pair is a record in the database. The value is the data we require (the anagrams) and the key is the index (the sorted string).

No comments:

Post a Comment