Recently I released an app called Ghost 2D. I’d like to share some of the stuff I did to deal with the challenges presented by games with a lexicon of a language in them.
First off, how do you check if a word is in the lexicon or not? The easiest but slowest way, as any computer programmer will tell you, is to look sequentially through the lexicon for your word. A normal human being will take advantage of the fact that words can be ordered and go straight to the section dedicated to the word’s first letter in a dictionary, and then zero in on the word by moving less and less through the book as more initial letters match. This is much faster, for humans and computers alike. You have to store the lexicon in a certain manner, but unless you’re expecting to read it yourself, there’s no loss in doing so. The exact way to store the lexicon is as tree graph where edges correspond to letters and nodes correspond to sequences of letters and contain a bit of information which says whether or not the sequence of letters is a word. So if you want to know if “cart” is a word, you’d start at the initial node, move along its ‘c’ edge to the “c” node, then move along the “c” node’s ‘a’ edge to the “ca” node, then move along the “ca” node’s ‘r’ edge to the “car” node, then move along the “car” node’s ‘t’ edge to the “cart” node, where you’d check whether or not “cart” was a word and retrieve that, yes, it is a word. Now how long it takes to check a word depends mostly on how long the word is, not how long the lexicon is. Words in a lexicon are typically much shorter than the lexicon itself, so this is a good thing. This is, I believe, a specific example of a more general concept called the (stupidly named) trie. And there are yet better ways to do lexicons, as you’ll see next.
Next, you want to find out what words have a certain substring inside them. I don’t think there’s a fast and simple way to do this, and I don’t think the way I did it was much simpler than what was done here and definitely wasn’t as fast. The article also shows how to do a bit better than what I described above.
Next, where do you get your lexicon from? I found a free word list here. However, if you want to make a computer opponent for a player to play against, it’s nice if the computer doesn’t act like it knows every word ever. So I also wanted a list of common words. My own personal chat logs were one idea, but I didn’t chat enough since my last accidental loss of them to have a good sampling of common words, plus a lot of common typos tend to be uncommon words, so there was no good way to filter those out. There’re precompiled lists of words in decreasing frequency on Wikipedia, but I found relatively common words were still missing. In the end I just created a lexicon from words used in a few books from Project Gutenberg. Project Gutenberg hosts books whose copyright has expired, so a charming (to me) result of this is that the computer seems old and proper.
Now, there were still fairly common words missing, and if the human started playing hard words, I wanted the computer to be able to keep up. The reason words still seemed to be missing is that humans understand the morphological rules of the language they speak, and computers don’t. So a human might play the word “damper” because they know the word “damp” and the morpheme “er”. The word “damper” rarely appears in text, but since “damp” and “er” are both common, “damper” is effectively a common word. I wasn’t able to find a good source that grouped words with the same root, and I still had to deal with the human playing hard words, so I opted to let the computer look at the big lexicon, but only when words on the board forced it to.
I wanted the human players to be able to decide on what the lexicon should be. The worst way of doing that would be to ask them to supply all the words up front, before they play the game. That’s not realistic since there’re a lot of words. Instead, I provide a lexicon which I think is roughly what they want, and allow them to edit it on the fly. So although the lexicon isn’t what the players think it should be, any conflicts are resolved before they affect the game.