Suggest Tree: A Data Structure for Efficient Autocomplete

A compressed trie Figure 1: Compressed trie for the terms
ant, bat, bean, bear, bird, bug and cat
A ternary search tree Figure 2: The trie from Figure 1,
implemented as a ternary search tree

Autocomplete is a feature provided by many computer programs and web applications. It helps the user with typing by suggesting terms that start with the prefix typed so far. If the underlying set of terms is large and the terms are weighted somehow so that only the k highest-weighted matching terms are suggested, then the terms must be stored in a data structure that allows for a fast lookup of the top k matches. The "Suggest Tree" described here is such a data structure. An implementation in Java can be downloaded here.

The Suggest Tree for a given set of weighted terms is a compressed trie of the terms (see Figure 1) where each node holds a weight-ordered list of the k highest-weighted terms that start with the corresponding prefix. The precomputed suggestion lists may seem space-consuming at first, but the average length of a list in the tree tends to be very small as typically most of the nodes are at the bottom of the tree. Note that the number of nodes in the tree cannot be greater than twice the number of terms because for each term inserted into the tree, at most one new node is added and at most one existing node is split into two nodes. Note also that the character sequence of a node does not have to be stored explicitly – it can be read from any of the matching terms.

A Suggest Tree allows us to insert a new term or to reweight or remove an existing term. The suggestion lists along the path from the term's terminal node to the root of the tree can be updated quite easily. The only difficult situation arises when we remove a term that is listed in a list of length k, or when we reduce the weight of a term that is the last element in a list of length k. In both cases the term may have to be replaced in that list with the top unlisted term. To find the top unlisted term, we can search the suggestion list of each child node for the first term that is not in the list of the parent. This requires at most k simple address comparisons per child.

There are many ways to implement the trie structure of a Suggest Tree. A time- and space-efficient implementation is the ternary search tree, where the children of a trie node are arranged as a binary search tree (see Figure 2). Searching for a prefix of length m in a balanced ternary search tree with n terms requires at most m + log2 n character comparisons. We can build such a balanced tree by sorting the terms into alphabetical order and then inserting first the middle term and then recursively the middle term of the left and right subset. Fortunately, if we insert the terms in random order, a ternary search tree is usually close to balanced. Note that the frequently visited nodes near the root of a ternary search tree tend to be in the cache so that they can be accessed faster.

© 2008-2014 Nicolai Diethelm