Many autocomplete features suggest the top k highest-weighted terms that start with the prefix typed by the user. If the underlying set of weighted terms is large, the terms must be held in a specialized data structure that efficiently supports the top-k queries. The "suggest tree" described here is such a data structure. An implementation in Java can be downloaded here.
The basic idea is to store the terms in a compressed trie where each node holds a precomputed list of autocomplete suggestions, or more precisely, a weight-ordered list of the k highest-weighted terms that start with the corresponding prefix. The suggestion lists may seem space-consuming at first, but the average length of a suggestion list in the trie tends to be very small as typically most of the nodes are at the bottom of the trie. Notice that for each term inserted into the trie, at most one new node is added and at most one existing node is split into two nodes.
When a term is inserted, re-weighted, or removed, the corresponding suggestion lists are updated in a bottom-up manner, starting with the bottommost list and working towards the root. A special situation occurs when a term to be removed is an element of a suggestion list of length k, or when a term to be weighted lower is or becomes the lowest-weighted element of a suggestion list of length k. In both cases, the term may have to be replaced in this list with the highest-weighted unlisted term. To find this top unlisted term, the suggestion list of each child node is searched for the first term that has not yet made it into the upper list.
The trie is implemented by a ternary search tree. In this simple but efficient trie implementation, the child nodes of a trie node are arranged as a binary search tree. If a ternary search tree is built from an alphabetically sorted list of n terms by inserting first the middle term and then recursively the middle term of the left and right sublist, the resulting tree is balanced so that searching for a prefix of length m requires at most m + log n character comparisons. Roughly speaking, each comparison either consumes one character from the prefix or cuts the search space in half. By contrast, if the terms are inserted in alphabetical order, each of the small binary search trees within the tree degenerates into a linked list, increasing the cost of lookups. Fortunately, if the terms are inserted in random order, a ternary search tree is usually close to balanced.
The terminal node of a term stores the weight of the term and a reference to the term itself. Each suggestion list is actually a list of references to terminal nodes. Also, only the first character of a node is stored explicitly, subsequent characters are read from the referenced term (non-terminal nodes reference the same term as their middle child).