Suggest Tree: A Data Structure for Efficient Autocomplete

A compressed trie Figure 1: A compressed trie of the terms ant, bat, bean, bear, bird, bug and cat
A balanced compressed ternary search tree Figure 2: A balanced compressed ternary search tree of the terms ant, bat, bean, bear, bird, bug and cat

Autocomplete is a feature provided by many computer programs and web applications. It helps the user with typing by suggesting terms that start with the prefix typed so far. If the underlying set of terms is large and the terms are weighted somehow (e.g. by popularity) so that only the k highest-weighted matching terms are suggested, then this requires a data structure that efficiently supports the top-k queries. The suggest tree described here is such a data structure. An implementation in Java can be downloaded here.

The suggest tree for a given set of weighted terms is a compressed trie of the terms where each node holds a weight-ordered list of the k highest-weighted terms that start with the prefix or prefixes corresponding to the node. The precomputed suggestion lists may seem space-consuming at first, but the average length of a list in the tree tends to be very small as typically most of the nodes are at the bottom of the tree. Note that the number of nodes in the tree cannot be greater than twice the number of terms, since for each term inserted into the tree, at most one new node is added and at most one existing node is split into two nodes.

Suggest trees allow you to insert a new term or to reweight or remove an existing term. The suggestion lists along the path from the term's terminal node to the root are updated if necessary. The worst case occurs when you remove a term that is listed in a list of length k, or when you reduce the weight of a term that is or becomes the lowest-weighted element in a list of length k. In both situations, the term may have to be replaced in the list with the highest-weighted unlisted term. To find this top unlisted term, the suggestion list of each child node is searched for the first term that is not in the list of the parent. This requires at most k simple address comparisons per child.

The trie structure of a suggest tree is implemented by a compressed ternary search tree. This trie implementation variant, where the children of a trie node are arranged as a binary search tree, usually offers a good trade-off between time and space costs. For example, searching for a prefix of length m in a balanced ternary search tree of n terms requires at most m + log2(n) character comparisons. One can build such a balanced ternary search tree from an alphabetically sorted set of terms by inserting first the middle term and then recursively the middle term of the left and right subset. Note that the frequently visited nodes near the root tend to be in the cache so that they can be accessed faster. Note also that only the first character of a node is stored explicitly; subsequent characters are read from one of the matching terms.

Last modified: 2014-09-01