Suggest Trees: A Data Structure for Efficient Autocomplete

A compressed trie
Figure 1: A compressed trie with the terms
ant, bat, bee, beef, bird, bug, and cat
A ternary search tree
Figure 2: The trie from figure 1,
implemented as a ternary search tree

Autocomplete is a feature provided by many popular applications. It helps the user with typing by suggesting terms that start with the prefix typed so far. But if the underlying set of terms is large and only the top k matching terms are suggested, how can we efficiently implement such a feature? One approach is to use a data structure like the Suggest Tree presented here. It enables you to quickly look up the k highest-weighted terms with a given prefix in a set of weighted terms. And it allows you to insert, reweight, or remove a term at low cost.

Basically, a Suggest Tree is just a compressed trie of the terms in which each node holds a weight-ordered list of the k highest-weighted terms in its subtree. The precomputed suggestion lists may seem space-consuming at first, but the average length of a list in the trie tends to be very small, as typically most of the nodes are at the bottom of the trie. Note that a compressed trie with n terms has at most 2n nodes, since for each term inserted into the trie, at most one new node is added and at most one existing node is split into two nodes. Note also that the character sequence of a node does not need to be stored explicitly; it can be read from any of the terms in the subtree.

There are many ways to implement the edges of a trie. One of the simplest is to organize the child nodes of a trie node as a linked list, with the parent keeping a pointer only to the first child. Another, more efficient way is to organize the children as a binary search tree, with the parent keeping a pointer only to the root child. This structure is called a ternary search tree, and we use it for the Suggest Tree.

Ternary search trees are sensitive to the order in which you insert the terms into the tree: If you have an alphabetically sorted set of terms and you insert first the middle term and then recursively the middle term of the left and right subsets, you end up with a balanced ternary search tree where the search space is cut more or less in half each time you go left or right in the tree. If, by contrast, you insert the terms in alphabetical order, each of the small binary search trees within the ternary search tree degenerates into a linked list, increasing the cost of lookups (especially when the alphabet is large). Fortunately, if you insert the terms in random order, a ternary search tree with many terms is virtually always close to balanced. The number of "bad" insertion orders is just too small compared to the number of "good" insertion orders.

To get a tree that always looks as if the terms have been inserted in random order, no matter what the actual insertion order was, the ternary search tree underlying a Suggest Tree is randomized. The idea, adopted from treaps, is to give each term a randomly chosen numeric priority and arrange the nodes as if the terms have been inserted in priority order. For this, each node is given a priority too, namely the priority of the highest-priority term in its subtree, and each of the binary search trees within the tree is organized as a heap with respect to the node priorities.

When you insert, reweight, or remove a term, the suggestion lists along the path from the term's terminal node to the root of the trie may need to be updated. The worst case occurs when you remove a term that is listed in a list of length k, or when you reduce the weight of a term that is or becomes the lowest-weighted element in a list of length k. In both situations, the term may need to be replaced in that list with the highest-weighted term in the subtree that has not yet made it into the list. To find this top unlisted term, it suffices to look at the node holding the list and to search in the suggestion list of each child node.   Javadoc   Download

© 2008-2014 Nicolai Diethelm