Thursday, May 28, 2009

Anagrams Part 1 - Brute Force

Beginner programmers are often surprised when a supposedly simple program takes a lot longer to execute than they expect. After all if a modern PC can simulate an entire battle consisting of hundreds of soldiers in real-time 3D, surely it can display all anagrams of a phrase without pausing to think for hours? Consider the problem of finding all anagrams of a word that are the same length as the word. For example, the word "integral" has 8 characters and has the following 8 character anagrams:

alerting
altering
integral
relating
triangle

For completeness we include the original string in the list of anagrams. I used a word list of around 38,000 common English words to get the anagrams. So how can we program a computer to find these anagrams? A first guess algorithm might be:
for each permutation p of characters in string s
...if p exists in the wordlist then print anagram
Luckily the C++ Standard Library contains a handy function called next_permutation(first, last) that takes a pair of iterators as arguments. The function changes the order of the elements in the range [first, last) to the next lexicographic permutation and returns true. If there is no next_permutation, it arranges the sequence to be the first permutation and returns false. For all permutations to be generated the starting sequence must be initially sorted in ascending order.

So, using this function we can easily generate all permutations of a word and check them against a word list.

Such a program might look like this:
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <algorithm>

using namespace std;

int main() 
{ 
    vector<string> dict;
    int word_count = 0;
    
    // Make sure words.txt is in your current directory.
    // Search the web for a suitable word list.
    ifstream in_file("words.txt");
    if (!in_file) {
        cerr << "word list not found" << endl;
        exit(1);
    }
    // read word list and store words in a vector
    string s;
    while (in_file >> s) {
        dict.push_back(s);
        word_count++;
    }
    
    cout << word_count << " words read" << endl;
    string str;
    while (true) {
        int count = 0;
        cout << "\nEnter letters (q to quit): ";
        cin >> str;
        if (str == "q") break;
        bool perms_remaining = true;
        sort(str.begin(), str.end()); // necessary so that next_permutation finds all permutations
        while (perms_remaining) {
            count++;
            vector<string>::const_iterator iter = find(dict.begin(), dict.end(), str);
            if (iter != dict.end())
                // we have found an anagram so print it
                cout << *iter << " ";
            perms_remaining = next_permutation(str.begin(), str.end());
        }
        cout << endl << count << " permutations" << endl;
    }
}


When I run this on my Core 2 Duo 2.1GHz PC it takes 21 seconds to find all 5 anagrams. There are noticeable pauses between each found word. Adding another letter to the input to make, say, integrals and the time rockets to 3 min 12 sec. So going from 8 characters to 9 characters results in a 9-fold increase in time. Yes folks this is an O(n!) algorithm, one of the worst performing possible.

There are n! permutations if the string has n unique characters, so for n = 8 there are 40320 permutations. For each of these the program scans the word list of 38000 words to find a match. Considering there are only a handful of anagrams for any given word, that means in effect the full list is being scanned every time. That makes 40320*38000 = 1,532,160,000 string comparisons. Now that's brute-force! That's where the 21 seconds goes. And that's for a relatively short word. Surely there must be a better way? Of course there is! See Part 2 to find out.

No comments:

Post a Comment