Module WordUtils.dawg
Implementation of a DAWG and associated trie.
What is a DAWG?
What is a DAWG? The acronym DAWG stands for "Directed Acyclic
Word Graph." Some searching on USENET yields this explanation:
[A t]rie is a tree based data structure for compact storage of a
dictionary. Words that start with the same letters share a path in the
Trie. For example, a Trie containing SIN, SINE, SING, PIN, PINE and PING
looks like this:
*
/ \
P S
/ \
I I
/ \
N N
/ \ / \
E G E G
A leaf denotes the end of a word, but there are also word ends at inner
nodes; usually a word end is represented with an extra flag in every node.
A lot of space can be saved by merging identical subtrees, like this:
*
/ \
P S
\ /
I
|
N
/ \
E G
The resulting data structure is sometimes called a DAWG (directed
acyclic word graph). Note that this merging step doesn't affect the
algorithms on the tree at all, since there is no pointer to the
parent of a node.
Each node in a trie has a value, and a list of other nodes that come
after it. One can determine whether a word is in the trie by searching
down the tree structure. If the word has been built and the current
node is marked as an endpoint or is a leaf, then the word is in the
trie.
Implementation Notes
The method used below to compress a trie into a dawg is fairly much
home-grown. "Hints" on the web have discussed working upwards
from the bottom of the tree, and what we've come up with works off
that. Check the documentation for the Dawg.compress method for more details.
Many thanks to Dave Lofquist for help with the algorithm.
Most of the public methods in the Dawg object are implemented in
what I think of as hierarchical fashion. What this means is, a public
method will be implemented almost entirely in terms of other, private
methods. Since the Dawg object supports both on-disk and in-memory
trees, for instance, most of the public methods are implemented in
terms of two different private methods, where the public method doesn't
do much except choose between them.
I think the method naming convention should be obvious, but I'll
spell it out anyway. Single-word methods are lower-case only, i.e.
lower(). Multiple-word methods are
lowerUpper(). The in-memory variant of a method starts
with _mem, i.e. _mem_lowerUpper(), while the
on-disk variant starts with _disk, i.e.
_disk_lowerUpper(). A recursively called method ends with
_r, i.e. _mem_lowerUpper_r(). For the most
part, any sub-methods associated with a public method are grouped with
it, unless they stand on their own.
Note that ON_DISK dawgs are inherently read-only (i.e.,
you cannot insert new words into or remove words from
ON_DISK dawgs). They're also not going to be as fast as
IN_MEMORY dawgs. Use them when you want to conserve memory
and when performance isn't your top priority.
Database Format
The Dawg.save method provides the option to
store the a dawg on disk in either a wordlist or "database"
format. The "database" format is essentially an on-disk
representation of the in-memory dawg, using file pointers instead of
object references. Once a database has been saved, you have the option
of using it to load an in-memory dawg, or instead accessing the
database directly from disk, using very little memory in the
process.
If you spend the time to compress a dawg generated from a large
wordlist, you will definitely want to save it in database format, to
avoid having to recompress the dawg every time you create it.
The database file format is not intended to be human readable.
However, it should be portable across operating systems, since I use
the Python marshal module to write the data to disk. As of now, I have
not done extensive testing on platforms other than Linux.
Documentation Notes
In the docstrings below, I tend to use 'trie' and 'dawg'
interchangeably. Technically speaking, they're not really the same
thing, although I think it's fair to say that a dawg is a trie, but not
vice-versa. In this code, the only difference is that a dawg is
compressed, so a brand new Dawg object is really a trie.
I only provide docstrings for public interface methods. I figure
that the arguments to the private methods should be fairly obvious
given the documentation for the public methods.
Keep in mind that any backslash characters in Epydoc markup must be
escaped, so in the markup you'll always see two of them together. This
is sometimes a bit confusing.
Author: Kenneth J. Pronovici <pronovic@ieee.org>
| Classes |
Dawg |
Class representing a Directed Acyclic Word Graph, or DAWG. |
Node |
Class representing a node in an in-memory trie or dawg. |
| Exceptions |
DawgError |
General exception class for this module. |
IN_MEMORY
-
- Type:
-
int
- Value:
|