Categories
dissertation SortTables

SortTables – CIKM ’95

SortTables – A Browser for a Digital Library

William C. Wake* and Edward A. Fox*#

Department of Computer Science* and Computing Center#
Virginia Polytechnic Institute and State University
Blacksburg, VA 24061-0106
{wakew, fox}@cs.vt.edu

Abstract: 

Much research in information retrieval has focused more on
matching results to queries than on browsing those results. After
briefly exploring browsing in physical and electronic libraries, we
introduce SortTables, a new system that focuses on support for
browsing. We explore the evolution of the system in light of early
implementation experience and formative evaluation of the
interface. Finally, we briefly review related work, and discuss
future directions.

To appear in CIKM ’95, The 4th International Conference on
Information and Knowledge Management, Nov. 28-Dec. 2, 1995,
Baltimore, Maryland. Draft: August 14, 1995.

Browsing in the Physical Library

In a physical library, we think of browsing as moving among the
shelves, looking at items of possible interest. Several
characteristics of this type of browsing stand out: access to the
actual items, potential access to all the items, a sense of
neighborhood, variable focus, and the opportunity for
serendipity.

Physical browsing provides access to the actual items. For
example, books provide powerful cues about their meaning in a
situated way, through physical cues as well as content. A worn-out
book might be regarded as more (or less) interesting because of its
apparent popularity. A thick book might look like too much work;
one with an interesting cover might be chosen instead.

Browsing potentially provides access to all the items.
One has the sense that one could start at one end and go to the
other, looking at each book. This is different from trying to
retrieve books by generating queries: we don’t have to know all the
“query terms” in advance, and we don’t have to deal with the same
item twice (because it was retrieved by different queries).
The sense of neighborhood is an important reason that browsing
is effective. Since books are organized by call number, and call
numbers are assigned in a way that keeps related books together, we
often find books on related topics close together.

Browsing allows a variable focus: in a promising area, one can
systematically examine all items; when items in that area aren’t
useful, one can make a sweeping movement through large areas,
skipping the uninteresting ones.

Finally, browsing allows for a natural serendipity: one can
choose a book at random from a particular area, or from a random
area of the library. As with variable focus, one has control over
how random one’s selection can be.

Browsing in the Electronic Library

Browsing in the electronic library can retain many of these
characteristics, except for access to the physical items.

Electronic browsing can provide something not feasible in the
physical library: multiple orderings of the items. We’re not
restricted to ordering strictly by call number: we can think of
browsing by author or date. (One could imagine a library where the
three-dimensional arrangement has significance; for example, the
floor of an item could correspond to year of publication or
publication type. However, the particulars of the three-dimensional
arrangement are usually only accidental, as shelves are usually
linearly organized by call number.)

Before computers, card catalogs offered some of this facility:
there were often card catalogs organized by author, subject, and
title. It would be unusual, however, to find a card catalog devoted
to organizing items by publication date or date of acquisition.
Electronic browsing can overcome some of the limitations of the
paper card catalogs. Electronic browsers can be much faster to use
(as a patron can sit at a terminal, perhaps in his own office,
rather than moving around the physical catalog). Browsers can
provide greater flexibility in manipulating the items. For example,
it is possible to search on both authors and years, and to present
the items in a variety of ways (e.g., sorted by publication
date).

Figure 1. The SortTables System

Figure 2. Restricted to ‘IBM J’

SortTables: A System for Browsing

SortTables is an interface metaphor developed to support
browsing. Information retrieval systems have often lacked a
browser, thus forgoing the benefits of this type of interaction, or
they have provided a simple electronic realization of the card
catalog, making little use of the manipulative capabilities of the
computer.

SortTables is not in any sense a complete solution to the
problems of digital libraries. It only addresses the problem of
presenting and manipulating a set of items; as presented here, its
search facilities are limited. It doesn’t address issues of how
documents are stored or retrieved, protocols, security, or the host
of other issues that must be addressed by any full system.
In SortTables, all items are presented in a table, where each
row corresponds to a record, and each column to an attribute of
that record. At any point in time, one of the columns is marked as
the sort key, and all rows are sorted according to it.

There are four essential functions that can be performed on a
SortTables table: movement through the table, sorting the items
according to one of the attributes, searching for a particular
attribute value, and restricting the items according to a range of
values of some attribute. The view is updated after each keystroke.
The up- and down-arrows and page keys will move by line or page,
the left- and right-arrows change the sort key column, alphanumeric
characters search incrementally (as each character is typed), and a
single keystroke can delete items not in a desired range.

Figure 1 shows the system running on
Envision’s data. (Envision is a database of computer science
literature under construction at Virginia Tech [FOX93b] [HEAT95].) Note that
the current line and the sort column are highlighted; items are
sorted according to values in that highlighted column. (The
highlighted column appears bold or dark gray on the screen; the ‘=’
divider under the column title marks it as well.) If there were an
active search string, it would appear after the text ‘Find:’. There
is a progress indicator showing approximately where the current
line is relative to the list of items (here, 98% through a list of
approximately 100K items).

With a few keystrokes, we could restrict our browsing to sets of
items such as: “Journal articles only, authored by Knuth, sorted by
year.”

Example

Suppose we have the columns Title, Journal, Year, Type, Author,
and Terms. To browse items for which “Journal is ‘IBM Journal of
Research and Development’ and Year >= 1980, sorted by Author,”
we could proceed as follows. Initially, title is the first column.
Press the right arrow key to sort by journal name, then type ‘ibm
j’ to move to the entry for that journal. Typing ‘=’ will restrict
the table to those items that match the search string, with results
shown in Figure 2.

Next we can move to the year column, and type ‘198<‘ to delete
items before 1980, with results shown in Figure 3.

Figure 3. Journal = ‘ibm j’ and Year >= 1980

Pressing the right-arrow key twice will sort the remaining items
by author, with results shown in Figure 4.

Notice that we are seeing the same item (the one authored by Adams)
in its neighborhood according to the new sort column.

Figure 4. Journal = ‘ibm j’ and Year >= 1980; sorted by Author

So, with about a dozen keystrokes, we have made a moderately
complex restriction, and can browse a useful subset of our
data.

Design Goals and Early Implementation

The SortTables system was designed with a number of goals in
mind:

  • Simple metaphor with a high degree of interactivity
  • Use of ranges and multiple orderings
  • Ability to view the whole database
  • Integrated browsing and searching
  • Progressive utility
  • Engender a sense of progress

The central metaphor is an automatically sorting table that can
have parts deleted based on attribute values. The idea of reacting
to each keystroke has been present from early on: rather than a
command being typed, or a form being filled in while the system
passively records keystrokes, this system is active while each key
is typed.

An electronic replica of a card catalog might allow for range
restriction on a single attribute. This metaphor extends that
capability with something real card catalogs can’t do: easily sort
themselves based on several attributes, and allow
restrictions based on them.

The ability to systematically view the whole database is a
characteristic of many browsers. SortTables supports that by
allowing one to page through all items based on any of the
attributes.

Many systems have distinct modes for searching and for browsing:
in search mode, a query retrieves a set of documents; in browse
mode, entities (or attribute values) are listed. Some systems, such
as the Computing Archive [ACM91], integrate both modes, allowing a user to browse an author list
while forming a query. The SortTables system is similar: the user
is always browsing, but can form queries while doing so.

The system allows for progressive levels of utility: movement
through the list, searching, sorting, and range restriction provide
increasing levels of sophistication.

Finally, the system tries to engender a sense of progress in its
use. This metaphor lets the user systematically restrict a
multi-dimensional space by dealing with one dimension at a time.
Actions either “look around” or they close in on a particular area
of the data.

The system has gone through several versions, with two different
concerns: exploring the interface, and developing the underlying
implementation. Here is how the successive versions have addressed
these issues:

  • Spreadsheet-based prototype. The first version was built on a
    spreadsheet, by adding buttons that would sort according to the
    various columns, and using native spreadsheet facilities for
    searching and scrolling.
  • Graphical user interface on a NeXT. This version was a
    prototype designed using the NeXT Interface Builder, and was used
    to explore implementation techniques in a graphical
    environment.
  • VT100 interface – initial version. The first VT100-based
    version was developed to provide a stable interface portable across
    a variety of platforms. This version has been used as a testbed for
    formative evaluation of the interface.
  • VT100 interface – current version. The current version has
    incorporated a number of improvements suggested from formative
    evaluation, and has been used as a testbed for data structures
    capable of supporting large data sets.

Formative Evaluation

A formative evaluation of the first VT100-based version was
performed, to improve the system before over-committing to a
particular interface design. The evaluation took two forms: an
interface designer (a Ph.D. student in the area of human-computer
interaction) critiqued it, and two students served as users in a
usability test. At the time, the system performed reasonably well
with up to about five thousand items. (The worst case for sorting
was less than ten seconds; other operations were faster). The
results of this evaluation were used in developing the current
version.

The interface critic suggested adding redundant visual coding to
indicate the current sort column and a progress indicator showing
about how many items were left. He helped tune some of the
assignments of functions to keys. Finally, he encouraged future
development of a graphical version.

The students in the usability test worked with three different
data sets: directory information resulting from the Unix ls
-l
listing command (25 items in 7 columns), data extracted
from the CIA world fact database (250 items in 6 columns), and a
collection of library catalog records (about 5000 items in 5
columns). These were selected to test the system on a variety of
types of data sets.

The core of the interface was retained, but several changes were
made to ease interaction. Page-up and -down keys were added, to
improve navigation. Two functions were removed: undo, and the
deletion of items not equal to the pattern. The training material
was modified to put increased emphasis on the notion of the current
sort column. Finally, sorting performance was identified as a
potential bottleneck.

Currently, the system supports on the order of a million
records, so the interface evaluation should be revisited with this
larger data set.

Data Structure

To support the interface, a new data structure is being
developed: the thread file. Its key idea is to use bounds
information for subsets of the data to avoid searching many
records.

  Figure 5. Bounds information tree for the first attribute

The Basic Idea

The bounds of a set of records is a mapping from each
attribute to its minimum and maximum values in that set. Suppose we
have three records A, B, and C, with the values <3,4,5>,
<0,3,4>, and <2,2,2> respectively. The bounds are:
<(0,3), (2,4), (2,5)>. This says that the first attribute
ranges from 0 to 3, the second from 2 to 4, and the third from 2 to
5.

There is a complete binary tree of bounds information for each
attribute. Figure 5 shows an example of one
such tree (for the first attribute). For simplicity, the tree shows
a data set with only two attributes. Nodes are labeled with the
bounds information for each attribute, in the form:
<(low_1,high_1), (low_2,high_2)>. Figure
5
graphically shows how each parent’s box is the bounding box
of its two children. Notice also how successive levels of the tree
split the box based on the first attribute.

There is a “known bad” bit associated with each node of the
tree, set whenever the (sub-)tree is known not to contain any
possibly useful children.

To evaluate a query, the tree for the desired ordering is used.
A tree walk compares the nodes to the query:

  • If the node is marked “known bad,” return failure.
  • If the node is a record (i.e., a leaf) and its bounds are
    valid, return that record.
  • If the bounds don’t intersect the query, mark the node “known
    bad” and return failure.
  • Otherwise, walk both children. If upon returning, both children
    are marked “known bad”, then mark this node “known bad” before
    returning failure.

The “known bad” bit is set along the way, so a node will not be
evaluated again if it has already failed. The nature of the
interface ensures that succeeding queries will be tighter than
their predecessors, so bad nodes won’t suddenly become useful
again.

For example, suppose we use the tree to locate items according
to the range restriction <(4,8), (7,8)>. The top node
intersects that range (of course), so we continue down the tree.
Moving down, the node labeled <(1,4),(1,6)> does not
intersect the query range, so we mark it “known bad” and ignore
its children. On the other side of the tree, <(5,6),(2,8)>
intersects the range, as do both its children. We will examine the
leaf nodes: <(5,5),(8,8)> is good, <(6,6),(2,2)> is
not, <(7,7),(7,7)> is good, and <(8,8),(3,3)> is not.
The good nodes are returned as valid records, and the bad ones are
marked “known bad.”

Bytes  Location  Type of data
8K2N/B memory Slice of bounds tree
KN/(8B) memory Bound bits (“known bad”)
4KN disk Leaf layer of bounds tree
KN/8 memory  Record bits (“known bad”)
4KN disk  Buckets 
Table 1. Space required by the thread file. K is the number of columns, N the number of records,  B is the number of records per bucket. Each index is assumed to take 4 bytes.

Data Set Envision Library
Records (N) 100,000 1,000,000
Columns (K) 6 5
Records/Bucket (B) 16  64
Record space (disk) 22M  167M 
Thread file space (disk) 7M  47M 
% Thread file space (disk) 7.2M 47.1M
Memory required 2M  4M 
Table 2. Actual space overhead for thread file (rounded to the nearest megabyte).  “Memory required” does not include cache space for information on disk.

Refined Data Structure

As described, this structure makes great demands on main memory.
To reduce these demands, the “known bad” bits are kept in memory,
while the rest of the tree is demand-paged from disk. To reduce
space, we want to avoid storing layers of the tree which provide us
little information. For example, upper layers of the bounds tree
aren’t useful: they tend to enclose almost all of the data.

For each attribute, we keep a list of the records, in the order
they appear when sorted by that attribute. For the records A, B,
and C above, the lists are: <B,C,A> (when sorted by the first
attribute), <C,B,A> (the second), and <A,B,C> (the
third). When there are no restrictions outside the current sort
order, this enables straightforward display of the records in
order.

These lists are divided into buckets, whose size is
chosen to be convenient relative to the disk block size.
The current implementation maintains only three layers of the
tree: the level for which bounds information corresponds to a disk
block (64 records on our system), the level for buckets of B
records, and the bounds information for individual records. The
lower layers of the tree provide very fine-grained information, but
they are so numerous that they are expensive to track. (The
Envision data set has about 100K items; this means 100K position
records, plus 10K bounds records per attribute. If the full tree
were stored, it would require an additional 90K bounds records for
each attribute.)

Table 1 shows the space requirements of this
leaner data structure. Items whose location is “memory” reside
permanently in main memory; “disk” items are loaded on demand and
cached. Table 2 shows these requirements for
the two large data sets we have loaded.

Limitations

This data structure is undergoing refinement and tuning. We plan
to add a small auxiliary index in the style of Glimpse [MANB93] to support regular expression queries.
The thread file data structure does not have uniform performance
for all queries. When there are very few records left, there can
come a point where it spends more time looking at all the records
that don’t match the query than at finding those that do. For the
Envision data set, this was not a problem, but for the library data
set the delay became noticeable when there were fewer than about
5000 records. This problem is solved by switching representations
when result sets become small.

Current Status

The system has been loaded with the two large data sets
described in Table 2 above. The Envision data
set consists of bibliographic data extracted from the Envision data
base. The library data set consists of information extracted from
the university’s online card catalog.

We have not formally assessed performance, but the following
figures should give an idea of the interaction, for the system
running on a DEC Alpha workstation with either database. Moving by
line or page feels instantaneous (the system appears to spend all
its time updating the screen). Changing the sort column happens
almost as quickly. Searching by typing characters takes up to about
two or three seconds. Finally, restricting items by range takes
about five to fifteen seconds. These times should improve with
further algorithm development and performance tuning, but they
suffice to give the system a highly interactive feel.

Related Work

Chang and Rice [CHAN93] survey browsing
from a variety of perspectives. They distinguish browsing from
other types of information seeking, and develop a taxonomy of
dimensions along which to analyze browsing: contextual, behavioral,
motivational, cognitive, and resource-based. The work reported here
fits best with what they call the library, information science, and
information retrieval traditions: browsing for unplanned discovery
and as a problem-solving technique.

Thompson and Croft [THOM89] address the
benefits and requirements of browsing, and point out that browsers
need a rich set of links, a flexible user interface, and search
capabilities. They have a graphical system in which document and
concept neighborhoods can be browsed. Crouch et al. [CROU89] discuss a similar system focused on browsing clusters of documents.

Development of the SortTables interface took place in the
context of the MARIAN [FOX93a] and Envision
[FOX93b] systems at Virginia Tech. MARIAN
uses a vector model, and provides simple lists of ranked results,
but lacks a browsing capability. Envision uses MARIAN’s search
engine, but adds an unusual tool for viewing results: a “bubble
view” of icons, laid out in a flexible two-dimensional
arrangement. Envision is limited in browsing abilities: although
the display is flexible, it only supports viewing at most a few
hundred items from a previously made query. The SortTables
interface simplifies Envision’s results display to a textual list,
but provides flexible ordering, range restriction, and direct
access to all the data.

The data structure to support the interface has two aspects:
orthogonal range query, and sorting. (“Orthogonal range query”
refers to queries of the form “For all i: min_i <= val_i <=
max_i” as required by the interface.) Many researchers have
proposed data structures to support such queries, e.g., k-d trees
[BENT75] [WILL85],
filtering search [CHAZ86], and grid files
[NIEV84]. Unfortunately, none of these
address the requirement for sorting. The grid file uses a large
multi-dimensional array to store buckets of data, indexed by a set
of one-dimensional scales. The others are tree-based. Our data
structure was directly inspired by the grid file, although it
resembles the others in using trees.

Future Directions

Several aspects of this system are targeted for work over the
next months:

  • Improved algorithms and data structures. The system has
    reasonable performance on a DEC Alpha workstation, but the
    performance could be improved. It would be nice to have an
    algorithm that supported incremental additions to the database, and
    that allowed multiple simultaneous users in an efficient way.
    Several users could share trees, but they would need separate
    “bad” bits and other information.
  • Other application areas. SortTables is well-suited for data
    sets that have a moderate number of interesting attributes. In
    addition to the bibliographic, geographic, and directory data sets
    mentioned above, we have explored using it with data from e-mail
    and calendar applications. These data sets share the characteristic
    that rapid sorting and range restriction based on attribute values
    provide useful manipulations.
  • Networked interface. The current system is monolithic. It would
    be useful to split it into a networked client-server
    implementation, to allow broader use and to facilitate exploration
    of the thread file data structure with multiple concurrent
    users.
  • Graphical interface. While the early prototypes used a
    graphical interface, later versions have evolved away from that.
    The graphical interface would have radio buttons as attribute
    labels (to choose the sort column) and a scroll bar for navigation.
    It would retain the single-window design, keeping the search field
    on the window with the browsing list. A Tcl/Tk version [OUST94] is currently being designed.
  • Evaluation. To assess the utility of this approach, it should
    be evaluated in real use. With the library card catalog data
    loaded, we hope this will make the system potentially useful to a
    broad class of users, so it can be effectively evaluated.

Acknowledgements

This work was supported in part by the National Science
Foundation through CISE Institutional Infrastructure (Education)
Grant CDA-9312611. It has benefited from the criticism and
discussion of several colleagues, especially Dr. Kevin Mayo of
SAIC.

Bibliography

[ACM91] ACM. Computing Archive:
Bibliography and Reviews from ACM
. New York, NY: ACM Press,
1991.

[BENT75] J. L. Bentley. “Multidimensional
Binary Search Trees used for Associative Searching,”
Communications of the ACM, 18(9), 1975, pp. 509-517.
[CHAN93] Shan-Ju Chang and Ronald E. Rice.
“Browsing: A Multidimensional Framework,” in Martha E. Williams,
ed.: Annual Review of Information Science and Technology
(ARIST)
, Volume 28, 1993.
[CHAZ86] B. Chazelle. “Filtering Search: A
New Approach to Query Answering,” SIAM Journal on
Computing
, 15(3), 1986, pp. 703-724.
[CROU89] Donald B. Crouch, Carolyn J.
Crouch, and Glenn Andreas. “The Use of Cluster Hierarchies in
Hypertext Information Retrieval,” in Hypertext ’89
Proceedings
, Pittsburgh, PA, ACM, 1989, pp. 225-237.

[FOX93a] Edward A. Fox, Robert K. France,
Eskinder Sahle, Amjad Daoud, and Ben E. Cline. “Development of a
Modern OPAC: From REVTOLC to MARIAN,” SIGIR’93: Proceedings of
the Sixteenth Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval
, Pittsburgh, PA, ACM,
1993, pp. 248-259.
[FOX93b] Edward A. Fox, Deborah Hix, Lucy
T. Nowell, Dennis J. Brueni, William C. Wake, Lenwood S. Heath, and
Durgesh Rao. Users, User Interfaces, and Objects: Envision, a
Digital Library,” Journal of the American Society for
Information Science
, 44(8), 1993, pp. 480-491.
[HEAT95] L. Heath, D. Hix, L. Nowell, W.
Wake, G. Averboch, and E. Fox.  “Envision: A User-Centered
Database from the Computer Science Literature,” Communications
of the ACM
,  38(4), Apr. 1995, pp. 52-53.

[MANB93] U. Manber and S. Wu. “GLIMPSE: A
Tool to Search Through Entire File Systems,” Technical Report No.
TR 93-34, University of Arizona Department of Computer Science,
1993.

[NIEV84] J. Nievergelt, H. Hinterberger, and
K. C. Sevcik. “The Grid File: An Adaptive, Symmetric, Multi-Key
File Structure,” ACM Transactions on Database Systems,
9(1), 1984, pp. 38-71.
[OUST94] John Ousterhout. Tcl and the Tk
Toolkit
, Reading, MA: Addison-Wesley, 1994.
[THOM89] R. H. Thompson and W. B. Croft.
“Support for browsing in an intelligent text retrieval system,”
International Journal of Man-Machine Studies, Volume 30,
1989, pp. 639-668.

[WILL85] D. E. Willard “New Data Structures
for Orthogonal Range Queries,” SIAM Journal on Computing,
14(1), 1985, pp. 232-253.