| Home | Trees | Index | Help |
|
|---|
| Package WordUtils :: Module tree |
|
When you have to store a set of strings, what data structure should you use? You could use hash tables, which sprinkle the strings throughout an array. Access is fast, but information about relative order is lost. Another option is the use of binary search trees, which store strings in order, and are fairly fast. Or you could use digital search tries, which are lightning fast, but use lots of space. ...[T]ernary search trees...combine the time efficiency of digital tries with the space efficiency of binary search trees. The resulting structure is faster than hashing for many typical search problems, and supports a broader range of useful problems and operations. Ternary searches are faster than hashing and more powerful, too.
The implementation below is conceptually based on the data structure described in the Ternary Search Trees article, although I have made some changes because the article's implementation is focused toward C. The Python objects are undoubtedly larger than their C counterparts, but there is some code simplificiation due to Python's more sophisticated data structures. The Python implementation is likely also somewhat slower than the C implementation, although I do not have any benchmarks.
Most of the public methods in the TernarySearchTree object are implemented in what I think of as hierarchical fashion. What this means is, a public method will be implemented almost entirely in terms of other, private methods. Since the TernarySearchTree object supports both on-disk and in-memory trees, for instance, most of the public methods are implemented in terms of two different private methods, where the public method doesn't do much except choose between them.
On first glace through this code, it may appear obvious that I should just create two different objects (an in-memory tree and an on-disk tree) to clean things up. Actually, since there is very little in-memory structure for an on-disk tree (just a filename, really) there would be very little benefit to reorganizing the code. As it stands, it's not too hard to follow.
I think the method naming convention should be obvious, but I'll
spell it out anyway. Single-word methods are lower-case only, i.e.
lower(). Multiple-word methods are
lowerUpper(). The in-memory variant of a method starts
with _mem, i.e. _mem_lowerUpper(), while the
on-disk variant starts with _disk, i.e.
_disk_lowerUpper(). A recursively called method ends with
_r, i.e. _mem_lowerUpper_r(). For the most
part, any sub-methods associated with a public method are grouped with
it, unless they stand on their own.
Note that ON_DISK trees are inherently read-only (i.e.,
you cannot insert new words into or remove words from
ON_DISK trees). They're also not going to be as fast as
IN_MEMORY trees. Use them when you want to conserve memory
and when performance isn't your top priority.
TernarySearchTree.compress method exists
for the TernarySearchTree object. However, it is a no-op right now. I
have provided it (along with its regression tests) so that it will be
easy to support compression in the future, if I can implement a decent
compression algorithm.
The TernarySearchTree.save method provides
the option to store a tree on disk in either a wordlist or
"database" format. The "database" format is
essentially an on-disk representation of the in-memory tree, using file
pointers instead of object references. Once a database has been saved,
you have the option of using it to load an in-memory tree, or instead
accessing the database directly from disk, using very little memory in
the process.
I only provide docstrings for public interface methods. I figure that the arguments to the private methods should be fairly obvious given the documentation for the public methods.
Keep in mind that any backslash characters in Epydoc markup must be escaped, so in the markup you'll always see two of them together. This is sometimes a bit confusing.Author: Kenneth J. Pronovici <pronovic@ieee.org>
| Classes | |
|---|---|
Node |
Class representing a node in an in-memory ternary search tree. |
TernarySearchTree |
Class representing a ternary search tree. |
| Exceptions | |
|---|---|
TreeError |
General exception class for this module. |
| Home | Trees | Index | Help |
|
|---|
| Generated by Epydoc 2.1 on Tue Sep 27 17:02:10 2005 | http://epydoc.sf.net |