Word List
Status: implemented
The word list provides an efficient way to store words for the editor.
Requirements
We need to support the following use cases:
Efficient lookup of clues when filling in a crossword. We want to avoid linear searches through the words when filtering.
Filters need to filter by letters. For example,
"?OR??"
will match WORDS and CORES, among others.
Multiple concurrent instances of the word list.
Internationalized word lists.
Find anagrams out of a set of letters.
We need this to be as fast as we possible. We want to be able to efficiently build puzzles recursively in order to find possible solutions for sections of the puzzle.
We also never need to have a list of words of mixed word LENGTH. It’s always sorted by word length. We never have any list with both CAT with MOUSE in it. Also, note that when we talk about length, we mean the number of code points for a given word, and not the number of bytes. So AÑO
has a length of three, despite being four bytes long (encoded in UTF-8).
Overall approach
We store the word data in an mmappable() file. The file consists of three distinct blocks of data, each described in greater detail down below:
Word List – A list of all the words in the data set. These words are bucketed by word length in different subsections. Within each bucket they are stored first by priority, and then alphabetically.
Filter Fragment List – A list of words, indexed by letter, sorted by word length. So, for example: The three letter word subsection would have all the words that match
"A??"
followed by all the words matching"B??"
, all the way up through all the words matching"??Z"
Index Section – The table of contents for the file. It will have byte offsets for each section and subsection of the file. It’s located at the end of the file, and is stored as a json block.
Everything in the list will be indexed by a byte offset into the file. We will mmap it and seek through it to find what we need.
When looking for the list of all words of a given length, it’s just the Word List bucket for that word length. When we are looking for a filter pattern with one character selected (such as ??M????
) the Letter Index will store this list. When we want to do two or more characters within a filter (such as ??MO??T
), we build up a temporary list out of the union of each filter fragment. We can do a quick walk through each looking for matching words, and since it’s sorted by index we only have to do one pass.
API
WordList is a GObject
that implements a GListModel
-like API. It has a stateful “filter” property that determines what makes available. The filter defaults to ""
– the empty word list.
WordList *word_list_new (void);
void word_list_set_filter (WordList *word_list,
const gchar *filter);
guint word_list_get_n_items (WordList *word_list);
const gchar *word_list_get_word (WordList *word_list,
guint position);
gint word_list_get_priority (WordList *word_list,
guint position);
The priority of each word is a value between 0 and 255, and defaults to 50.
Like the GListModel
interface, this lets the user get the number of words (per filter) and get the word/priority for each position. However, unlike GListModel
it doesn’t emit signals and doesn’t return GObjects for performance reasons. That’s a relatively simple task to do, and WordListModel
is a wrapper around WordList
that provides this interface.
Note that WordList
is completely stable. It will always return the same answers for a given filter. This means that you can set the filter to be some value and iterate through the items, change the filter, and then set it back and continue your iteration.
Data Sections
As mentioned, the overall resource file is divided into three blocks of data:
The Word List Sections
The Filter Fragments
The Index Section
Each is described in detail below.
Word List Sections
This block stores all the words along with their priority. The block is divided into multiple Word List Sections – one for each word length. So, for example, there’s a section for all the words that are three characters long, followed by a section for all the words that are four characters long, etc.
Each word entry in a section consists of a 1-byte priority stored as an unsigned char, followed by a UTF-8 string terminated by the null character. It will be padded out by ‘\0’ characters to fit the longest word (byte-wise) word within the section, stored as STRIDE. See the charset section below for a little more clarification.
Filter Fragment section
The Filter Fragment section contains two blocks of memory. The first block contains the Letter Lists, which are basically a big block of gushorts. This is a list of all the word lists for a filter fragment concatenated together. It’s not possible to find out any state from within the block, and it’s not possible to calculate anything about it. They’re in a predetermined order, but there’s nothing special about the order.
The Letter List is also large; we store a gushort
per letter per word, which means it is twice as big as the Word List (for English). As an example, for the fragment ??T
we store the offset of every word in the three-letter Word List
block that matches that pattern. That will be immediately followed by a list of all the words that match ??U
, and so forth.
The second part of the Filter Fragment section is the index for the Letter List block. This is a segment of geometrically increasing set of offsets to determine the section of the Letter List contains the list of words. We store an guint for the byte offset for each list, and a gushort for the length of each list.
The Filter Fragment section is super confusing to describe, so here’s an example. Consider the query "??T"
:
Start at Length = MIN-LENGTH
+-+ +-+
|A|x25 more |A|x25 more
+-+ +-+
Length = 2 (Offset = 0)
/ "??T" IS IN HERE
+-+ +-+ +-+
|A|x25 more |A|x25 more |A|x25 more
+-+ +-+ +-+
Length = 3 (Offset = 52 * 6)
+-+ +-+ +-+ +-+
|A|x25 more |A|x25 more |A|x25 more |A|x25 more
+-+ +-+ +-+ +-+
Length = 4 (Offset = 130 * 6)
...
Continues up to Length = MAX-LENGTH
In the example above, the query is found at offset 744. The next 6 bytes would contain the integer offset within the Letter List of the fragments, and the short listing the length of the section.
Assumptions
We’re storing the length of each filter fragment section in the word list as a gushort, so we assume that none has a length longer than 65,536 words. So far that’s true with the lists we’re using.
Index Section
The Index Section is a json block located at the end of the file. It’s located by seeking to the end of the file and scanning backwards until a ‘\0’ character is reached. Despite being at the end of the file, it’s read first as it has the keys to understand the data within this file.
Data we store in the header:
Valid character list found in the puzzle. This can be used for building a filter, and is stored as an array of unichars
Meta information about the data that’s stored
Locations for each word list. Example, the location for all 2 letter words, all 3 letter words, etc.
Loctaion of the filter fragment section
Example Index
\0{
"charset": "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"filterchar": "?",
"min-length": 2,
"max-length": 21,
"threshold": 50,
"letter-list-offset": OFFSET,
"letter-index-offset": OFFSET,
"words": [[2, STRIDE, OFFSET, LENGTH],
[3, STRIDE, OFFSET, LENGTH],
[4, STRIDE, OFFSET, LENGTH],
...
[21, STRIDE, OFFSET, LENGTH]]
}
Note the existence of the leading \0
character to indicate the start of the section.
CHARSET is a unicode string, UTF-8 encoded.
STRIDE is the maximum width (in bytes) of each word in the table (see note below).
OFFSET is an unsigned int with the number of bytes into the file the table starts.
LENGTH is the number of entries in the table.
Anagrams
TBD.
I’d like to be able to find anagrams with this word list, but I’m not sure yet if the letter index will give us what we need, or if we need another structure.
Internal representation
It’s really easy to get confused among all the internal tables. As a result, we use the following structs as our primary reference that we look things up.
typedef struct
{
gint length;
gint index;
} WordIndex;
typedef struct
{
gint length; /* Length of the word */
gint position; /* Position of the character within the word */
gint char_offset; /* location within the charset of the word */
} FilterFragment;
The WordIndex gives as a reference to a word (and its priority). We can look up a word with just these two fields. It is public and refersable. So we can look up a word by Word Index, as well as find the word index of a given word (if it exists).
The FilterFragment gives us an individual part of a filter, with only one character selected. So “???G????” is a filter fragment, and “???G??N?” is made out of two fragments combined. We can look up a list of WordIndexes for a given Filter Fragment.
Charsets and Internationalization
We have made every effort to make this UTF-8 clean / Unicode-aware. We normalize every word using Normalization Form C. For English, this has the practical result that every character is uppercase.
The charset is important for calculating the filter table. We index chacters based on their offset within the Charset. For the default english charset it’s “A” == 0; “B” = 1;, etc. For other languages, accents and other marks are considered independent characters. So the charset for spanish would have both N and Ñ in it, as well as accented versions of all the letters.
The difference between word length and stride is not always clear. An example is the french word PÂTÉ
. It has a LENGTH of 4, and a STRIDE of six. The Â
with the circumflex would be considered an independent character, and not be stored along with an unaccented A
.
A note on the words being used:
I’m using Peter Broda’s word list for English crosswords. It’s long and has plenty of dubious words, but it’s a good starting point. We will pre-generate the mmapped file and update it before each release. This word list seems to be actively maintained and updates regularly.
This file is mostly clean. It’s plain ASCII with \r\n line breaks. There are a few words in it with non-letters in it, and we just discard those (Example: MIA!;50 has an exclamation point in it). There’s no explicit license in the file but permission is granted on the website:
You are free to use this list in any way you’d like. This includes commercial uses, though I’d appreciate it if you didn’t just turn around and try to sell it (but I mean, I’ll still offer it for free to anyone so that wouldn’t be a smart business venture anyway).