Suggest Tree: A Data Structure for Efficient Autocomplete

Suggest Tree in Java:  Javadoc  Download
A compressed trie
Figure 1: A compressed trie with the terms ant, bat, bean, bear, bird, bug and cat
A ternary search tree
Figure 2: The trie from Figure 1, implemented as a ternary search tree

Autocomplete is a feature provided by search engines and many other applications. It helps the user with typing by suggesting terms that start with the prefix typed so far. If the underlying set of terms is large and the terms are weighted somehow (e.g. by popularity), then the data structure containing the terms must allow for a fast lookup of the top k terms with a given prefix. And in case the data is dynamic with frequent changes, the structure must also allow you to insert, reweight, or remove a term at low cost. The Suggest Tree described here meets these requirements.

The Suggest Tree for a set of weighted terms is a compressed trie of the terms in which each node holds a weight-ordered list of the k highest-weighted terms in its subtree. The precomputed suggestion lists may seem space-consuming at first, but the average length of a list in the tree should be small in practice, as typically most of the nodes are at the bottom of the tree. Note that the number of nodes in the tree cannot be greater than twice the number of terms, because for each term inserted into the tree, at most one new node is added and at most one existing node is split into two nodes. Note also that only the first character of a node is stored explicitly – subsequent characters are read from one of the matching terms.

When you insert, reweight, or remove a term, the suggestion lists along the path from the term's terminal node to the root of the tree may need to be updated. The worst case occurs when you remove a term that is listed in a list of length k, or when you reduce the weight of a term that is or becomes the last element in a list of length k. In both situations, the term may need to be replaced in that list with the top unlisted term. The algorithm for finding the top unlisted term of a list L searches the suggestion list of each child node for the first term that is not listed in L. This requires at most k simple address comparisons per child. If the node holding list L is the terminal node of a term, the algorithm also checks whether that term is the top unlisted term.

The trie structure of a Suggest Tree is implemented as a ternary search tree. This implementation, where the children of a trie node are arranged as a binary search tree, usually offers a good trade-off between time and space costs. The only drawback is that the performance of a ternary search tree is affected to some extent by the order in which you insert the terms into the tree. Suppose you have an alphabetically sorted set of n terms: if you insert the terms in alphabetical order, each of the small binary search trees within the tree degenerates into a linked list – if, by contrast, you insert first the middle term and then recursively the middle term of the left and right subset, you end up with a balanced tree where searching for a prefix of length m will require at most m + log(n) character comparisons. Fortunately, if you insert the terms in random order, a ternary search tree is usually close to balanced.

© 2008-2014 Nicolai Diethelm