William C. Wake
1-30-96
NOTE: This draft does not have figures, and many formulas were ruined in the conversion.
Introduction
A user browsing a library’s online collection has several functions that are helpful in locating useful items: searching, sorting, and restricting sets of items. For the SortTables interface to support these functions, a key problem is how to provide a subset of the data, sorted by a particular attribute. To do this, we take advantage of these characteristics of the interaction:
- Queries can be regarded as operating on positions rather than attributes,
- Sequences of queries tend to operate on ever-smaller sets of items, and
- We don’t need the full database, just a subset that is visible on the screen.
The Data
The data consists of N records, each with P fields. We can represent this graphically:
Figure omitted
To support the SortTables interface, we will formalize queries against this set, and develop the thread-file data structure.
First, we will treat the set of records as a sequence. The ordering of the sequence is immaterial; the key is that this lets us treat a record’s position as a record identifier:
Since the interface deals with sorting, we will consider permutations of the records–one permutation per field. (A record’s position in the permutation corresponds to its position in the list of records sorted by that field.)
Given these permutations, we will want to ask “for each record, what is its position in each permutation?”:
Putting these together, we will consider our database to be this:
Queries
A SortTables query can be described as a set of ranges of attribute positions, and a chosen permutation. (At first glance, it might appear that a query is over a range of values rather than positions, but this is not so: one could delete only some entires with a given value.)
A range describes a set of positions:
A query consists of a permutation choice and a set of ranges:
The result of a query Q against a database DB is:
This result consists of those records that satisfy the query ranges, ordered by the query-specified permutation. The style of interaction leads the user to compose a series of increasingly restrictive queries:
In such a sequence, the result of a query is a subset of the results of all preceding queries.
Extreme Data Structures
For maximum speed, we could keep track of all combinations of permutations and index positions. For each combination, we can pre-compute the result sequence: M: [1,P] ´ ([1,N] ´ [1,N])P ® seq [1,N]
At a (huge) cost in space, we get instant results. (For N=1M, P=5, we will have a table with 5^1026 entries.)
At the other extreme, we might consider a data structure with no space overhead: keep the records in an array, and sort the array on demand. To remove items, just adjust the array bounds. This latter approach is fine when the data set is small (fits in main memory). In fact, the current implementation uses this approach under certain circumstances. Two problems make this impractical for large data sets. First, if the items aren’t in memory, the cost of sorting increases greatly (as accessing each item potentially causes a disk fault). Second, the O(n log n) speed of a typical sort is too slow when sets are large (even if they fit in memory). For example, if N is 1M items, millions of operations will strain even fast CPUs.
An Initial Data Structure
To stake out a middle ground, we’ll consider a data structure that maintains descriptive information about subsets of the data, in order to quickly eliminate records from consideration. This initial data structure uses complete binary trees of bounds information, one tree for each permutation.
A bounds object describes the ranges of a set of items:
For example, suppose we have three permutations of four records. In line with the Permuation-Inverse mapping, represent each record as a P-tuple, with a value for its position in each permutation. If these values are <3,5,4>, <1,1,1>, <2,3,6>, and <17,2,3>, then the bounds will be <(1,17), (1,5), (1,6)>. (That is, positions in the first column range from 1 to 17, in the second from 1 to 5, and in the third from 1 to 6.)
We will use a tree of bounds, corresponding to binary ranges.
The leaves of the tree are the records, ordered according to one of the permutations. The levels of the tree correspond to increasingly finer partitions of the records based on their position in the permutation:
[1,N] [1,N/2] [1,N/4] ... [N/4+1, N/2] ... [N/2+1, N] [N/2+1, 3N/4] ... [3N/4+1, N] ...
(This tree extends out until each range represents a single record, e.g., [5,5]).
For our data structure, we need a tree for each permutation:
To evaluate a query against this data structure, use the tree corresponding to the chosen permutation. Walk the tree using Algorithm 1, looking for items in the current range. When a range is potentially valid (its bounds information doesn’t contradict the query), look at its children. Should both children be determined not to match the query, backtrack.
ALGORITHM 1. Bounds tree walk.
procedure walk (t: Tree, q: Query, s: sequence of Record)
if t.bad return; // Skip trees known to be bad
// Check for any dimension out of range
if ~(FORALL i in [1,K]:
q.range[i].lower <= t.bounds.range[i].upper
AND q.range[i].upper >= t.bounds.range[i].lower )
then
t.bad := true // Yes, so tree is bad
else
if t is a leaf then // Found a good record
append the corresponding record to s
else // Look in subtree
walk (t.left, q, s)
walk (t.right, q, s)
// Check for both children eliminated
if t.left.bad AND t.right.bad then
t.bad := true;
fi
fi
fi
fi
end
Assume that a position takes 4 bytes, and that N is a power of 2. Then each bounds object will take 2x4xP bytes. There are 2N nodes, so each tree requires 8Px2N=16PN bytes. There are P trees, so total tree storage requires 16PPN bytes.
A typical data set might have N=1M, P=5, so we would need 400M bytes (just for the trees, not for the records themselves). This is clearly too large for a typical main memory configuration (though see Bentley [ref]).
A Slimmer Data Structure
The above data structure is moderately expensive in space, but we can trim it. The P trees are not independent. While the interior nodes are unrelated to each other, the leaf nodes of the tree (the records) correspond to the Permuation-Inverse mapping. This leaves us with PN interior nodes, each needing 2x4xP bytes, or 8PPN bytes of bounds information.
We will spend some of this savings to take advantage of the characteristics of the query sequence, by adding “bad” bits for each node and record. Since results in a sequence are subsets, when a node or record is found not to qualify, its bit will be set, and it never needs to be evaluated again for the particular query sequence. This requires PN/8 bytes for the interior nodes, and N/8 bytes for the records.
The current implementation only uses three levels of the tree: the disk block level (holding 64 records per block), the lower level which stores information about sets of 16 records, and the records at the leaves of the tree. The 16-record “bucket” was chosen empirically, as its average range size was N/2. The current implementation uses “bad” bits only for the disk-block level. (Higher levels of the tree provide relatively little useful bounds information, as they tend to serve as large samples of the input space. So, we can ignore them and start looking lower in the tree.)
For Envision (N=100K, K=5), this works to about 400K bytes for the disk block information, and 1.6M bytes for the 16-record layer, or about 2M bytes total for main-memory data (plus a cache for record information paged in).
Preliminary Analysis
There are two common situations where this data structure is optimal. First, when there are no restrictions in effect (e.g., when a user is searching and sorting, but not restricting the results), the data structure is acting as a simple sorted index of attribute values. (This is similar to the Scan facility of the Z39.50 information retrieval standard [ref].) The second ‘easy’ case is where the data is restricted on one attribute, and the permutation is on the same attribute. This is a straightforward variation of the preceding case.
When there are restrictions but they aren’t very stringent, the bounds tree will be most helpful. (Most items will be valid, and the bounds tree and bad bits can help eliminate those items that aren’t.) In this case, the tree just delivers most of the items in sorted order.
When multiple restrictions are involved, and the ranges become tight, the data structure becomes more expensive (even prohibitively so). For example, suppose there are 10 items resulting from the query “Year > 1960 and Author=Sedgewick”. When this set is sorted by Author, the range is tight. If we switch to sorting by call number, the items will become scattered through the full set. Scanning all the buckets just to find those ten items will be very expensive.
The current implementation switches to a simple array representation when the number of items is relatively small (on the order of 1000 items). This way, we use the bounds tree when the restrictions are moderate, and sorting would be impractical, and we use the sorting set when the bounds tree is less helpful.
[NOTE: I know I need to either work up a probabilistic analysis of this,
or at least simulate the effects under various loads.]
Brief Comparison to Related Methods
There have been many data structures proposed to deal with queries against multi-dimensional data. This section surveys a few of them.
Bentley [Bentley 1975] proposed k-d trees, an extension of binary search trees. A parent’s children are ordered with respect to one of the keys (usually a different key than the parent itself). Bentley’s “region query” is a general matching function based on walking this tree and comparing bounds information against the nodes to determine which ones match. The bounds information is not explicitly stored, but rather derived piecemeal during the walk. The nodes are not produced in any particular order (they are sorted only locally with respect to the parent).
Willard [Willard 1985] modifies this approach by introducing an auxiliary data structure only searched for critical nodes (those for which the children satisfy a constraint, but the parent does not). This reduces the search time, but does not address sorting the retrieved set.
Chazelle [Chazelle 1986] develops a structure in which the leaves are sorted with respect to one of the axes and each internal node has an auxiliary subtree associated with it, to answer orthogonal range queries. He also develops the idea that since range reporting queries that return a list of points must take time proportional to the number of items returned, when the output is large a naive search can be used with post-filtering of invalid records.
Finally, Nievergelt et al. [Nievergelt, Hinterberger et al. 1984] developed the grid file. It works by creating a two-layer structure: a k-dimensional grid maintained in main memory, and buckets on disk containing the items in a particular grid cell. Items in a bucket are guaranteed to fit a particular k-dimensional rectangle. Unfortunately, the size data file and number of dimensions we need require that the grid file itself be stored on disk as well, and demand-paged. Further, this data structure does not address sorting: a sorted result could potentially require hundreds of disk reads.
In summary, the sorting requirement is the prime thing differentiating the thread file from these other data structures. The notion of a query sequence lets us amortize the cost of finding “bad” nodes across multiple queries. Finally, with the exception of the grid file, these other algorithms don’t maintain a distinction between main and secondary memory for storage costs or access time.
Bibliography
Bentley, Jon (1995). “Software Explorations: A Megabyte Here, A Megabyte There.” Unix Review, (April): 85-91.
Bentley, Jon Louis (1975). “Multidimensional Binary Search Trees Used for Associative Searching.” CACM, 18(9): 509-517.
Chazelle, Bernard (1986). “Filtering Search: A New Approach to Query-Answering.” SIAM Journal on Computing, 15(3): 703-724.
Manber, Udi, and Sun Wu (1993). “GLIMPSE: A Tool to Search Through Entire File Systems,” TR 93-34, University of Arizona Department of Computer Science.
Nievergelt, J., H. Hinterberger, and K. C. Sevcik (1984). “The Grid File: An Adaptive, Symmetric, Multi-key File Structure.” ACM TODS, 9(1): 38-71.
Willard, Dan E. (1985). “New Data Structures for Orthogonal Range Queries.” SIAM Journal on Computing, 14(1): 232-253.