Earlier this year I wrote about solving word ladders with MATLAB. There was a lot of interest in that post, so I thought I'd share my investigations regarding another word-based app. In this script, I'm trying to create a puzzle modeled on the NY Times Spelling Bee
game. Here is the premise: you are given seven letters, and your job is to find all possible words you can make with those seven letters. You can use letters more than once, but they must be four or more letters long, and they must all include the special center letter. Every Spelling Bee puzzle has at least one "pangram", which is a word that uses all seven letters.
As an example, suppose you were given this puzzle.
From this puzzle you could create PORE, ADORE, ADOPT, as well as the pangram OPERATED. But you couldn't create POD (too short) or PAPER (doesn't include the center letter "O").
Here I'm not so much interested in solving the puzzle. I want to write a script that can create a Spelling Bee puzzle.
Build the Dictionary
We need some ground truth for all the legal words. Here's Google's 10,000 word English dictionary. It's small, but good enough for our purposes here.
url = 'https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english.txt';
Split the word list into strings.
words = split(string(wordList));
The Spelling Bee puzzle avoids the letter S so that it never has to deal with plurals. Let's use logical indexing to keep only the words that don't contain an S.
keep = ~words.contains("s");
Of these, keep only the words that are four letters or longer.
keep = words.strlength >= 4;
I like how these string commands are easy to read and understand! Just like for us humans, code that communicates clearly keeps its job the longest. We don't want to employ fussy, high-maintenance code.
How many words do we end up with?
fprintf("The adjusted dictionary now contains %d words\n",length(words))
We've thrown out about half the words in our dictionary.
We'll be converting from ASCII representation to alphabet number a few times, so let's invest in a little anonymous helper function. It'll make the code below a little easier to read.
% An anonymous function to translate ASCII values to A=1, B=2, ... Z=26.
ascii2letter = @(chars) abs(lower(chars))-96;
Try it out
Build a 26-column sparse matrix to represent unique letter usage for each word. Each word in the dictionary gets one row. Every letter that appears in the word gets a 1.
dictionaryLength = numel(words);
a = sparse(dictionaryLength,26);
for i = 1:dictionaryLength
uniqueLetters = unique(words(i).char);
a(i,ascii2letter(uniqueLetters)) = 1;
At 22% full, this matrix isn't terribly sparse as sparse matrices go. But it's a fun excuse to play with SPARSE, so let's carry on. Here's a SPY plot of the matrix.
PlotBoxAspectRatio=[1 1 1])
To make much out of this, we need to zoom in. Let's see how the word LABEL is encoded.
% In case you're wondering, I like to use "ix" as short for "index"
Note that even though the L appears twice in the word, it only has a one in the matrix.
Just for fun, let's do a histogram of the letters. This can be a useful sanity check that we're on the right track.
allLetters = ascii2letter(char(join(words,'')));
As expected, E is the most commonly appearing letter. Q is the least, but for the fact that we completely wiped out S.
How many words are composed of exactly seven distinct letters? That is, how many are potential pangrams for a game?
% Summing letter-use across the rows
% Use FIND to build an index to all the words that use exactly seven letters
ix7LetterWords = find(sum(a,2)==7);
Let's look at a few of these seven-letter words.
num7LetterWords = numel(ix7LetterWords);
ix = ix7LetterWords(1:10);
Pick a good one as the "seed" pangram for our game.
How many words can be spelled from those same 7 letters? Find all the words that use nothing but the letters in the seed word
lettersInSeedWord = a(ix,:);
% Find all the words that have none of the letters from the seed word.
% We're making sure that all contributions of letters other than
% the seed letters sum to zero.
ixWordsFromSeedWord = find(sum(a(:,~lettersInSeedWord),2) == 0);
fprintf('%d words use only the seven letters appearing in the seed word.\n',numel(ixWordsFromSeedWord))
Which letters appear most frequently?
We need to designate one letter as the "center letter" that must appear in every word. Let's pick the letter that appears least frequently.
vals = full(sum(a(ixWordsFromSeedWord,:)));
centerLetter = char(ixlet+96);
Remove all the words that don't have this letter
ixRequired = find(a(:,ixlet));
ix2a = intersect(ixWordsFromSeedWord,ixRequired);
This is the original seed word:
Here are the seven unique letters:
sevenLetters = unique(char(words(ix)));
The required center letter is:
Here are the pangrams
ix3 = find(sum(a(ix2a,:),2)==7);
Here is the complete list of words
Some of those are little odd, but we have our dictionary to thank for that.
sixLetters = setdiff(sevenLetters,centerLetter);
sixLetters = sixLetters(randperm(6));
Now for the graphics! Make the plot
% The perimeter hex cells are centered from pi/6 to 11*pi/6 radians
t = 2*pi*(0:5)/6 + 2*pi/12;
And there you have it! It's easy to see how valuable a good algorithmically generated game can be. Once you've worked out the logic, you just need to run it once every day and your puzzle is ready to go. Compare this with the labor that goes into a crossword puzzle. Although who knows? Maybe crossword puzzles will be yet another of those things that AI proves to be adept at.