Suggest Tree: A Data Structure for Efficient Autocomplete

A compressed trie Figure 1: A compressed trie of the terms ant, bat, bean, bear, bird, bug, cat.
A balanced compressed ternary search tree Figure 2: A balanced compressed ternary search tree of the terms ant, bat, bean, bear, bird, bug, cat.

Autocomplete is a feature provided by many computer programs and web applications. It helps the user with typing by suggesting terms that start with the prefix typed so far. If the underlying set of terms is large and the terms are weighted somehow so that only the top k highest-weighted matching terms are suggested, then this requires a data structure that efficiently supports the top-k queries. The suggest tree described here is such a data structure. An implementation in Java can be downloaded here.

A suggest tree is a compressed trie of the given weighted terms where each node holds a weight-ordered list of the k highest-weighted terms that start with the corresponding prefix. The precomputed suggestion lists may seem space-consuming at first, but the average length of a list in the tree tends to be very small as typically most of the nodes are at the bottom of the tree. Note that the number of nodes in the tree cannot be greater than twice the number of terms, since for each term inserted into the tree, at most one new node is added and at most one existing node is split into two nodes. Note also that only the first character of a node is stored explicitly; subsequent characters are read from one of the matching terms.

Suggest trees allow you to insert a new term or to reweight or remove an existing term. The suggestion lists along the path from the term's terminal node to the root may have to be updated in this case. The worst case occurs when you remove a term that is listed in a list of length k, or when you reduce the weight of a term that is or becomes the lowest-weighted element in a list of length k. Since the term may have to be replaced in the list with the highest-weighted unlisted term of the node, the suggestion list of each child node is searched for the first term that is not in the list of the parent. This requires at most k simple address comparisons per child.

The trie structure of a suggest tree is implemented by a compressed ternary search tree. This implementation variant, where the children of a trie node are arranged as a binary search tree, usually offers a good trade-off between time and space costs. For example, searching for a prefix of length m in a balanced ternary search tree of n terms requires at most m + log2(n) character comparisons. One can build such a balanced ternary search tree by inserting the terms in the same order as for a balanced binary search tree of terms.

Last modified: 2014-04-21