| Home | Trees | Index | Help |
|
|---|
| Package WordUtils :: Module dawg :: Class Dawg |
|
object --+
|
Dawg
| Method Summary | |
|---|---|
Default constructor. | |
Adds a key (word) to an existing trie/dawg. | |
Creates indexing for a DAWG. | |
Clears indexing for a DAWG. | |
Creates a new trie object, which is a clone of the trie starting at the given node. | |
Compresses a trie into a dawg, optionally returning compression ratio. | |
Builds a list of leaf nodes (nodes with no children) in the trie. | |
Builds a list of nodes that are a certain level above the bottom of a trie. | |
Returns a sorted list representation of the trie/dawg. | |
Reads a wordlist or database file into a dawg. | |
Reads a Python list into a dawg. | |
Searches for a pattern in a trie/dawg. | |
Removes a key (word) from a dawg. | |
Writes a dawg to disk as a wordlist and/or a portable "database" format. | |
Searches for a complete string in a trie/dawg. | |
Returns the dawg's size in terms of nodes. | |
Traverses a dawg and executes func(word) for each
word. | |
| Inherited from object | |
x.__delattr__('name') <==> del x.name | |
x.__getattribute__('name') <==> x.name | |
x.__hash__() <==> hash(x) | |
T.__new__(S, ...) -> a new object with type S, a subtype of T | |
helper for pickle | |
helper for pickle | |
x.__repr__() <==> repr(x) | |
x.__setattr__('name', value) <==> x.name = value | |
x.__str__() <==> str(x) | |
| Property Summary | |
|---|---|
db: Location of database on disk. | |
indexed: Indicates whether the dawg has been indexed. | |
root: Root node of trie/dawg. | |
type: Dawg type (IN_MEMORY or ON_DISK). | |
wl: Location of wordlist on disk. | |
| Method Details |
|---|
__init__(self,
type=0,
db=None,
wl=None)
|
add(self, key)Adds a key (word) to an existing trie/dawg. This function inserts into a dawg by building a list of the words in the dawg, inserting the new key into the list, and then re-building the dawg based on the new list. Obviously, this is rather memory intensive and inefficient. The whole point of using a dawg in the first place is to avoid having a list of words around. However, I haven't found any other good way to do this. You may not really need this functionality. Normally, a dawg would
be filled using the
|
buildIndex(self)Creates indexing for a DAWG. This method generates excess indexing information for each node.
Included in the indexing information is a list of the node's parents,
and an indication of the node's depth from the top of the trie. This
information is mostly used by the IN_MEMORY dawgs.
|
clearIndex(self)Clears indexing for a DAWG. This method can only be used forIN_MEMORY dawgs.
|
clone(self, node)Creates a new trie object, which is a clone of the trie starting at the given node. Note that the cloned trie will not be the "same" trie from
the perspective of the IN_MEMORY dawgs.
|
compress(self, ratio=False)Compresses a trie into a dawg, optionally returning compression ratio. The compression method is fairly simple. It may not be ideal in terms of performance, but it should yield correct results. Before doing anything else, we index the trie using the When indexing is complete, the first pass at at compression can begin. The main purpose of the first pass is to merge identical leaves. We build a list of leaves, and merge any leaves that have the same value. All leaves are by definition endpoints, and can have no children, so there's no need to do any comparisons against anything other than value. After all of the leaves have been merged, we move onto the second pass at compression. This pass always looks one "level" up from the current level. For instance, the first time through, it looks at the parents of the leaves. We'll compare each of the nodes, and if any two are identical [1], we'll merge them. When this process is complete, we'll construct a new list, up one level (i.e. parents of the nodes we just operated on, which the first time through would be the parents of the parents of the leaves) and start all over again. The whole thing continues until the new list of parents contains just one node, the top node. There is no other valid exit condition, since the trie does not necessarily have a constant depth (because our input words may vary in length). [1] It is important to note that for the second pass, two nodes can only be considered identical if they have the same value, the same endpoint flag and exactly the same set of children - not children with the same values, but the exact same children. Since we work from the bottom to the top, any two nodes that meet these conditions must represent subtries that are recursively identical, all of the way down to the bottom of the trie. Such nodes can then safely be merged without losing any information. On large wordlists, a large portion of the nodes in the resulting trie will be leaves, so a large space savings can be achieved just by merging down to the ideal 26 leaves (one for each letter). The remainder of the processing trims even more space. The compressed dawg often contains only 25% as many unique nodes as the original uncompressed trie did. The algorithm appears to yield databases on disk that are 1-2 times as large as the original wordlist used to construct the dawg, which makes it feasible for a program using this dawg to distribute databases, not wordlists. The algorithm is reasonably fast for moderate-sized lists of words (a few minutes for 80k words on my Duron 850 with 800 MB of RAM). However, as the list grows larger, things slow down considerably. I have already eliminated most of the obvious bottlenecks. For instance, instead of using lists that must be appended to, I use dictionaries when I can, and then generate the lists on the fly as I need them. This got me an order of magnitude improvement in processing time (that 80k list used to take most of five hours to complete). More time spent with the Python profiler has helped me eliminate a few other silly sources of inefficiency. However, I'm now at the point where I can't find anything else obvious with the profiler. I think that for this to get any faster, it's going to take a major algorithmic change, rather than a tweak. I would be happy if someone were to prove me wrong. :-) This method may only be used onIN_MEMORY trees.
|
leaves(self)Builds a list of leaf nodes (nodes with no children) in the trie.
|
level(self, level=0)Builds a list of nodes that are a certain level above the bottom of a trie. We start at the leaves, and then move up a certain number of levels up from the leaves. For instance, level 0 is the leaves themselves, level 1 is the set of the parents of all of the leaves, level 2 is the parents of that set, etc. We can only do this if the DAWG has been indexed, so if it hasn't been yet, we do it here. Note: the results from this method are probably only useful when the method is called against an entirely uncompressed trie, since a given node might be at more than one level in a compressed dawg. This is mostly a debugging method. Also, bear in mind that the "level" in this method is not the same as the depth value generated by thebuildIndex method.
|
list(self)Returns a sorted list representation of the trie/dawg.
|
loadFile(self, db=None, wl=None)Reads a wordlist or database file into a dawg. Either a database or a wordlist may be used, but not both. If a wordlist is used, it is assumed to be properly sorted on disk (i.e. the words must be in alphabetical order). Neither file will ever be written to. Calling this function on a dawg that has been previously initialized will overwrite the contents of that dawg. You may not choose to load a wordlist into anON_DISK
dawg. First, create an IN_MEMORY dawg using the wordlist.
Then, save the IN_MEMORY dawg as a database. Then, you can
load use the database on disk as the source for your
ON_DISK dawg.
|
loadList(self, list)Reads a Python list into a dawg. Calling this function on a dawg that has been previously initialized will overwrite the contents of that dawg. You may only load a list into anIN_MEMORY dawg. If you
need the list in an on-disk dawg, create an IN_MEMORY dawg
using the list. Then, save the IN_MEMORY dawg as a
database. Then, you can load the database on disk as the source for
your ON_DISK dawg.
|
patternSearch(self, pattern)Searches for a pattern in a trie/dawg. A pattern uses the. (period) character to represent a
wildcard. So, for instance, the pattern '.ai.' would match
the words 'rail', 'hail',
'mail', etc. if those words were in the trie/dawg.
|
remove(self, key)Removes a key (word) from a dawg. This function removes a word from a dawg by building a list of the words in the dawg, removing the key from the list, and then re-building the dawg based on the new list. Obviously, this is rather memory intensive and inefficient. The whole point of using a dawg in the first place is to avoid having a list of words around. However, I haven't found any other good way to do this. You may not really need this functionality. If you only need to
remove a few words from an existing dawg, consider whether you might be
better off just keeping around a small Python list or dictionary to
hold those words. You could then use the list or dictionary to validate
the results returned from the
|
save(self, db=None, wl=None, overwrite=False)Writes a dawg to disk as a wordlist and/or a portable "database" format. Both the ON_DISK, then this
method cannot be used to overwrite the in-use database on disk.
|
search(self, key)Searches for a complete string in a trie/dawg.
|
size(self)Returns the dawg's size in terms of nodes. This method works by creating a dictionary containing an entry for each unique node. It's rather inefficient, and you may not want to use it in production code. This method can only be used forIN_MEMORY dawgs.
|
traverse(self, func=None)Traverses a dawg and executes
lambda x: sys.stdout.write("%s\n" % x)
which will print out each word on its own line (the words will be
unordered).
|
| Property Details |
|---|
dbLocation of database on disk. |
indexedIndicates whether the dawg has been indexed. |
rootRoot node of trie/dawg. |
typeDawg type (IN_MEMORY or ON_DISK). |
wlLocation of wordlist on disk. |
| Home | Trees | Index | Help |
|
|---|
| Generated by Epydoc 2.1 on Tue Apr 4 21:54:30 2006 | http://epydoc.sf.net |