Suggest Tree: A Data Structure for Efficient Autocomplete


A compressed trie
Figure 1: A compressed trie with the terms ant, bat, bean, bear, bird, bug and cat
A ternary search tree
Figure 2: The trie from Figure 1, implemented as a ternary search tree

Autocomplete is a feature provided by many applications. It helps the user with typing by suggesting terms that start with the prefix typed so far. But if the underlying set of terms is large and only the top k matches are suggested, how can we efficiently implement such a feature? One approach is to use a data structure like the Suggest Tree described here. It enables you to quickly look up the highest-weighted terms with a given prefix in a set of weighted terms. And it allows you to insert, reweight, or remove a term at moderate cost. A Suggest Tree in Java can be downloaded here.

The Suggest Tree for a given set of weighted terms and a given value of k is a compressed trie of the terms in which each node holds a weight-ordered list of the k highest-weighted terms in its subtree. The precomputed suggestion lists may seem space-consuming at first, but the average length of a list in the tree tends to be very small, as typically most of the nodes are at the bottom of the tree. Note that a compressed trie with n terms has at most 2n nodes (for each term inserted into the trie, at most one new node is added and at most one existing node is split into two nodes). Note also that the character sequence of a node does not need to be stored explicitly – it can be read from any of the terms in the node's subtree.

When you insert, reweight, or remove a term, the suggestion lists along the path from the term's terminal node to the root of the tree may need to be updated. The worst case occurs when you remove a term that is listed in a list of length k, or when you reduce the weight of a term that is or becomes the last element in a list of length k. In both situations, the term may need to be replaced in that list with the highest-weighted unlisted term of the list. To find the top unlisted term of a list L, we can search in the suggestion list of each child node for the first term that is not listed in L. But caution: if the parent node that holds L is the terminal node of an unlisted term, then that term is a candidate for the top unlisted term of L too.

The trie structure of a Suggest Tree is implemented as a ternary search tree. This implementation, where the children of a trie node are arranged as a binary search tree, usually offers a good trade-off between time and space costs. The only drawback is that the performance of a ternary search tree is affected to some extent by the order in which you insert the terms into the tree. Suppose you have an alphabetically sorted set of n terms: if you insert the terms in alphabetical order, each of the small binary search trees within the tree degenerates into a linked list – if, by contrast, you insert first the middle term and then recursively the middle term of the left and right subsets, you end up with a balanced tree where searching for a string of length m will require at most m + log(n) character comparisons. Fortunately, if you insert the terms in random order, a ternary search tree is usually close to balanced.


© 2008-2014 Nicolai Diethelm