Thursday, May 28, 2009

Anagrams Part 1 - Brute Force

Beginner programmers are often surprised when a supposedly simple program takes a lot longer to execute than they expect. After all if a modern PC can simulate an entire battle consisting of hundreds of soldiers in real-time 3D, surely it can display all anagrams of a phrase without pausing to think for hours? Consider the problem of finding all anagrams of a word that are the same length as the word. For example, the word "integral" has 8 characters and has the following 8 character anagrams:

alerting
altering
integral
relating
triangle

For completeness we include the original string in the list of anagrams. I used a word list of around 38,000 common English words to get the anagrams. So how can we program a computer to find these anagrams? A first guess algorithm might be:
for each permutation p of characters in string s
...if p exists in the wordlist then print anagram
Luckily the C++ Standard Library contains a handy function called next_permutation(first, last) that takes a pair of iterators as arguments. The function changes the order of the elements in the range [first, last) to the next lexicographic permutation and returns true. If there is no next_permutation, it arranges the sequence to be the first permutation and returns false. For all permutations to be generated the starting sequence must be initially sorted in ascending order.

So, using this function we can easily generate all permutations of a word and check them against a word list.

Such a program might look like this:
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <algorithm>

using namespace std;

int main() 
{ 
    vector<string> dict;
    int word_count = 0;
    
    // Make sure words.txt is in your current directory.
    // Search the web for a suitable word list.
    ifstream in_file("words.txt");
    if (!in_file) {
        cerr << "word list not found" << endl;
        exit(1);
    }
    // read word list and store words in a vector
    string s;
    while (in_file >> s) {
        dict.push_back(s);
        word_count++;
    }
    
    cout << word_count << " words read" << endl;
    string str;
    while (true) {
        int count = 0;
        cout << "\nEnter letters (q to quit): ";
        cin >> str;
        if (str == "q") break;
        bool perms_remaining = true;
        sort(str.begin(), str.end()); // necessary so that next_permutation finds all permutations
        while (perms_remaining) {
            count++;
            vector<string>::const_iterator iter = find(dict.begin(), dict.end(), str);
            if (iter != dict.end())
                // we have found an anagram so print it
                cout << *iter << " ";
            perms_remaining = next_permutation(str.begin(), str.end());
        }
        cout << endl << count << " permutations" << endl;
    }
}


When I run this on my Core 2 Duo 2.1GHz PC it takes 21 seconds to find all 5 anagrams. There are noticeable pauses between each found word. Adding another letter to the input to make, say, integrals and the time rockets to 3 min 12 sec. So going from 8 characters to 9 characters results in a 9-fold increase in time. Yes folks this is an O(n!) algorithm, one of the worst performing possible.

There are n! permutations if the string has n unique characters, so for n = 8 there are 40320 permutations. For each of these the program scans the word list of 38000 words to find a match. Considering there are only a handful of anagrams for any given word, that means in effect the full list is being scanned every time. That makes 40320*38000 = 1,532,160,000 string comparisons. Now that's brute-force! That's where the 21 seconds goes. And that's for a relatively short word. Surely there must be a better way? Of course there is! See Part 2 to find out.

Friday, May 15, 2009

TTMath - a bignum library for C++

Came across this handy BigNum library for arbitraily large numbers in C++. All you need to do is include a single header and then declare a bignum type such as:
typedef ttmath::UInt<100> BigInt;
which creates a type that can hold unsigned integers between 0 and 2 ^ (32*100)-1. Which is absolutely huge. Actually you can go even bigger if the stack size on your OS permits. Yes, this library uses the stack for all calculations, there are no dynamic allocations. So now you can define a factorial function like so
BigInt factorial(BigInt n)
{
    if (n == 1 || n == 0)
        return 1;
    n *= factorial(n-1);
    return n;
}
and display all those fun factorials that we love so much. Did you know 35! = 10333147966386144929666651337523200000000 Fascinating eh!