Module WordUtils.tree
Implementation of a ternary search tree and related functionality.
What is a ternary search tree?
What is a ternary search tree? The article Ternary
Search Trees by Jon Bentley and Bob Sedgewick in Doctor Dobb's
Journal, April 1998, discusses them. The article has this to say:
When you have to store a set of strings, what data structure should
you use? You could use hash tables, which sprinkle the strings
throughout an array. Access is fast, but information about relative
order is lost. Another option is the use of binary search trees,
which store strings in order, and are fairly fast. Or you could use
digital search tries, which are lightning fast, but use lots of
space.
...[T]ernary search trees...combine the time efficiency of digital
tries with the space efficiency of binary search trees. The
resulting structure is faster than hashing for many typical search
problems, and supports a broader range of useful problems and
operations. Ternary searches are faster than hashing and more
powerful, too.
Implementation Notes
The implementation below is conceptually based on the data structure
described in the Ternary Search Trees article, although I have made
some changes because the article's implementation is focused toward C.
The Python objects are undoubtedly larger than their C counterparts,
but there is some code simplificiation due to Python's more
sophisticated data structures. The Python implementation is likely also
somewhat slower than the C implementation, although I do not have any
benchmarks.
Most of the public methods in the TernarySearchTree object are
implemented in what I think of as hierarchical fashion. What this means
is, a public method will be implemented almost entirely in terms of
other, private methods. Since the TernarySearchTree object supports
both on-disk and in-memory trees, for instance, most of the public
methods are implemented in terms of two different private methods,
where the public method doesn't do much except choose between them.
On first glace through this code, it may appear obvious that I
should just create two different objects (an in-memory tree and an
on-disk tree) to clean things up. Actually, since there is very little
in-memory structure for an on-disk tree (just a filename, really) there
would be very little benefit to reorganizing the code. As it stands,
it's not too hard to follow.
I think the method naming convention should be obvious, but I'll
spell it out anyway. Single-word methods are lower-case only, i.e.
lower(). Multiple-word methods are
lowerUpper(). The in-memory variant of a method starts
with _mem, i.e. _mem_lowerUpper(), while the
on-disk variant starts with _disk, i.e.
_disk_lowerUpper(). A recursively called method ends with
_r, i.e. _mem_lowerUpper_r(). For the most
part, any sub-methods associated with a public method are grouped with
it, unless they stand on their own.
Note that ON_DISK trees are inherently read-only (i.e.,
you cannot insert new words into or remove words from
ON_DISK trees). They're also not going to be as fast as
IN_MEMORY trees. Use them when you want to conserve memory
and when performance isn't your top priority.
The TernarySearchTree.compress method exists
for the TernarySearchTree object. However, it is a no-op right now. I
have provided it (along with its regression tests) so that it will be
easy to support compression in the future, if I can implement a decent
compression algorithm.
Database Format
The TernarySearchTree.save method provides
the option to store a tree on disk in either a wordlist or
"database" format. The "database" format is
essentially an on-disk representation of the in-memory tree, using file
pointers instead of object references. Once a database has been saved,
you have the option of using it to load an in-memory tree, or instead
accessing the database directly from disk, using very little memory in
the process.
The database file format is not intended to be human readable.
However, it should be portable across operating systems, since I use
the Python marshal module to write the data to disk. As of now, I have
not done extensive testing on platforms other than Linux.
Documentation Notes
I only provide docstrings for public interface methods. I figure
that the arguments to the private methods should be fairly obvious
given the documentation for the public methods.
Keep in mind that any backslash characters in Epydoc markup must be
escaped, so in the markup you'll always see two of them together. This
is sometimes a bit confusing.
Author: Kenneth J. Pronovici <pronovic@ieee.org>
| Classes |
Node |
Class representing a node in an in-memory ternary search tree. |
TernarySearchTree |
Class representing a ternary search tree. |
| Exceptions |
TreeError |
General exception class for this module. |
IN_MEMORY
-
- Type:
-
int
- Value:
|