Categories
category theory dissertation

Dissertation – Ch 3, A View of Documents

Chapter 3 – A View of Documents

Draft part of dissertation of William C. Wake

Last updated: March 26, 1996

Documents as Categories

What does it mean to move this document to that
format? This is the key question we’ll address through the
framework of category theory. Category theory is a branch of
abstract mathematics dealing with the algebra of functions. It is
concerned with objects and the mappings between them, which
dovetails nicely with the problem of translating documents.

Category Theory: Background

Category theory addresses fundamental structures. It generalizes
several familiar mathematical constructs: sets, graphs, groups, and
others.
Pierce [] defines a category:

1.1.1 Definition. A category C comprises:

1. a collection of objects;

2. a collection of arrows (often called
morphisms);

3. operations assigning to each arrow f an object dom
f
, its domain, and an object cod f, its

codomain (we write f:A->B or A-f->B to show
that dom f = A and
cod f = B; the collection of all arrows with
domain A and codomain B is written C(A,B));

4. a composition operator assigning to each pair of arrows f
and g, with cod f = dom g, a composite arrow
gof: cod f -> cod g, satisfying
the following associative law:

for any arrows f:A->B, g:B->C, and

h:C->D (with A, B, C, and D not necessarily
distinct),

h o (g o f) = (h o g) o f

5. for each object A, an identity arrow
idA:A->A satisfying the following identity
law:

for any arrow f:A->B, idB o f =
f
and
f o idA = f.

Let’s look at a few examples. The simplest category is 0.
It has no objects and no arrows, so it satisfies conditions 3
through 5 vacuously. The category 1 has only one object. By
condition 5, we must have at least an identity arrow for it. Having
this one arrow trivially satisfies the other conditions. This
category looks like this:

           __
         v  
         o   |
          _/

The category 2 has two objects and three arrows, and looks
like this:

             __             __
            /  v           v  
            |  o --------->o   |
             _/            _/

We’ll see this one in a later example.
Two other important categories are Set and Graph.
“The category Set has sets as objects and total functions
between sets as arrows.” [Pierce, p. 2] “The category Graph
has directed multi-graphs as its objects. An arrow f:G->H

in Graph is a structure-preserving map between graphs.”
[Pierce, p. 51]
The next important notion is that of a functor. Again,
from Pierce [p. 36]:

2.1.1 Definition. Let C and D be
categories. A functor F:C->D is a map taking each

C-object A to a D-object F(A) and each
C-arrow f:A->B to a D-arrow

F(f):F(A)->F(B), such that for all C-objects A and
composable C-arrows f and g:

1. F(idA) = idF(A)

2. F(g o f) = F(g) o F(f)

Just as an arrow in a category maps between objects, functors
map between categories. The important thing is that functors
preserve structure (in an abstract sense): not only are objects
mapped, but arrows as well. As we will see, analyzing how arrows
are mapped is a powerful tool.
[[Further discussion TBD. Introduce commuting diagrams. Talk
briefly about several concepts: isomorphism, natural
transformations, limits, adjoints [canonical constructions],
colimits, and comma categories.]] [[Not clear these are really
needed.]]
Approach:

  • Embed category
  • Develop/define/identify functors
  • Transformations: One-way, invertible

Category Theory

A category is a collection of several things: objects, arrows, and
composition operators. Many familiar mathematical objects have a
category associated with them.
[Examples TBD]
The fundamental approach is to define a pair of categories, and
then define a mapping between them, known as a functor. A functor
maps both objects and arrows in a way that preserves identity and
associativity.
For example, consider a stack as one category, and integers as
another. Glossing over the details, each can be modeled as a
category. We will define a functor that maps each stack to its
size.
In a non-categorical approach, this might be a sufficient
description. But note that it only maps the objects, not the
arrows. For our stack category, we will consider the arrow push as
operating on stacks. The definition of functor requires that we
identify a mapping for arrows as well. So, if we push onto a stack,
its size increases by 1, so the corresponding arrow for integers is
incr.
What does this give us that a simple depth function does not?
One nice thing is the ability to do diagram chasing. Consider a
mapping:

        S    push    S'
        o ---------> o
        |            |
down    |            |  down
        V            V
        o ---------> o
        I    incr    I'

This diagram asserts that down(push) = incr(down). It tells us that
it doesn’t matter whether we push and then translate down, or
translate first and then use the increment arrow.
The S, S, I, and I’ are arbitrary values. We can use this fact
to build up complex diagrams by pasting two diagrams together on a
common edge:

        S    push    S'  push      S''
        o ---------> o ----------> o
        |            |             |
down    |            |  down       | down
        V            V             V
        o ---------> o ----------> o
        I    incr    I'   incr     I''

This shows us that

down o push o push = inc o inc o down

The proof is embedded in the diagram, so we are spared the tedium
of an algebraic proof.
Side Note: Categories are object oriented. Note how OO
defines objects as both data and functions, and CT as mapping of
data plus mapping of functions.

Fables

To show the application of the category theory notions, we’ll use a
few stories as springboards for discussion.

Fable 1. The Land Before Printers

In a faraway land, where computers were invented
before printers, a printer merchant came to town. The merchant drew
a picture like this:

     Online  o       o  Paper
              ______^
               print

and said, “This device converts your files into an image on
a piece of paper. Your work can be moved from the world of
electronics to the world of paper.” Everybody thought this was
great, and ordered one. It’s too bad those early printers had a
tendency to make mistakes.

The next time the merchant was in town, he heard about
it: “Sometimes it prints `The quick brown fox’ and others `The
quick jrown fox’. And what about that picture you
drew?”

The merchant replied, “That diagram just shows a document
moving from online to paper, and the printer does that. It’s not
100% accurate, but nothing’s perfect.” So, everybody put down money
in anticipation of the next software upgrade.

Moral: Domains matter.

We want to consider the category implied by the diagram. To the
merchant, there are only two objects, Online and Paper. To the
townspeople, there is an object corresponding to each document. A
simple attempt to identify the category reveals this
misunderstanding. Category theory is “typed”; it requires explicit
identification of the domains.
A similar analysis applies whenever crossing domains. Does a
translator from C++ to C promise to produce merely some C
program for a C++ program, or does it promise to produce a
particular C program for a given C++ program? The diagram
looks the same, but the analysis level can matter greatly.

Fable 2. Down Below the Ocean

A wizard came to the capital and bragged, “I’m
the last survivor of Atlantis; nothing is left except me and my
pen. As I’m the only person left who can read or write Atlantan,
and it is everywhere admitted to be the best language for
expressing laws, let me translate your laws into that glorious
tongue.”

The ruler thought this was worth a few million silver
coins, and authorized the translation. After several months, and at
great expense, the wizard returned with several volumes in a
strange script. The court was thrilled, paid its money, and sent
the wizard to the next kingdom.

A squirrel asked, “How do we know it’s really our laws
written in real Atlantan?”

Moral: Verification matters.

The ruler should have learned to read Atlantan (unless the
volumes were only meant as a coffee table decoration). This
situation structurally matches the previous one:

Laws1   o       o  Laws2
         ______^
         toAtlantan

The fable suggests the importance of having another arrow
toEnglish:

             toEnglish
              ______
             v      
     Laws1   o       o  Laws2
              ______^
             toAtlantan

Suppose the category objects are {Laws1, Laws2, Other}. We want
to require that

toEnglish o toAtlantan = idEnglish

that our translation is accurate. If there is no function to “close
the circle” then we have no means to verify the translation.
This is known as a commuting diagram. It is an assertion that
all paths for getting somewhere are equivalent. Diagrams can be
composed, and very complex paths established. In this simple case,
we realize that we have no way to verify the translation unless we
have the second arrow.
The same analysis applies to printers: if we add a “retype”
arrow, which means to re-enter a paper document back online, we
have the key notion that lets us see how the townspeople in the
first story can verify the problem: they see that the (obvious)
retype function is incapable of making the diagram commute. (If we
don’t trust the printer, we’d better have an online copy somewhere
to guide us in making corrections.)

Fable 3: Rarer Than a Pearl

After people had been printing for a while, they
started noticing a problem: sometimes they would have a paper
document for which they’d lost the online version, but they wanted
it back online. They had no choice but to re-enter the document
manually.

A new merchant came to town, an unprinter merchant. He
said, “Let me show you a picture,” and the townspeople groaned, but
he showed them anyway:

              unprint
              ______
             v      
      Online o       o Paper
             ______^
               print

He said, “I know the tales; I’d like to tell you my
unprinter works like this on all documents, but it doesn’t.” (He
was a very honest merchant.) “But, it is only 85% accurate.” (I
told you he was honest.) “Let me show you what it really
does”:

                      __________
                     v          
        D1  D2  D3  D4  D5  D6  |
           ^ ^   ^     ^   ^   /
          /  |   |    /   /   /
        P1  P2  P3  P4  P5   P6

“If you print D1 to P1, and then unprint it, you
unfortunately get D2 back. But, D2 and D3 work right, although I
admit D4, D5, and D6 get a bit scrambled.” He went on, “Everybody
needs one. And it’s already in Version 3.0, even though it just
came out.” The early adopters bought one, but most people decided
to wait for Version 3.1 (which would have all the bug
fixes).

Moral: Sometimes the hidden domains are what’s
interesting.

We have a discrepancy: the diagram

           unprint
            ______
           v      
    Online o      o Paper
           ______^
            print

applies at the highly abstract level (“is it online or not?”), but
does not apply at the level of individual documents (printing D1,
and unprinting the result, yields D2 rather than D1). But we want
to say more; after all, the unprinter is certainly doing something
useful.
When there is a leaky or lossy transformation, there is an
opportunity to find a “hidden” domain. This hidden domain may be
trivial, or it may not even be obvious, but it provides a point of
leverage. Here, the hidden domain might be an “abstract” version of
a document.

Unprinting a printed document may yield a different
document:

 D1 o           o D2
     \         ^
      \       /
 print \     /  unprint
        \   /
         v /
          o
         P1

but they are in the relation “D2 is what results when you
unprint what you get from printing D1.” We can consider a new
abstract document AD:

           AD
           o
          ^  ^
   up    /    \  up
        /      \
       /        \  
   D1 o          o D2
       \         ^
        \       /
  print  \     /  unprint
          \   /
           v /
            o
            P1

The up function takes a document to its abstract form. The diagram
asserts that

up(d) = up (unprint (print (d))).

But what are the abstract documents? We’ll treat the abstract
documents as a partition of the electronic documents: each
represents a set. We can construct them by taking the closure of
the print and unprint functions. Since unprint(print(D1)) = D2,
then D1 and D2 are in the same partition. In our example, the
abstract online documents are AD1 = {D1, D2}, AD2 = {D3}, AD3 =
{D4, D5, D6}. (We could build a similar structure for abstract
paper documents.)

[[Side note: This glosses over where arrow starts]]

So, up takes a document to the abstract document which
contains it. Since we have a partition, there is exactly one such
abstract document.

We can compare unprinters by the way they partition the
documents. One might be so bad it is essentially a “two-state”
version. Another might tend to arbitrarily substitute letters; it
has a partition for each possible length of document. A good one
might only confuse certain words (like “Smith”, “SmiTh”, and
“5mith”); it will have many partitions.
This example reveals a guideline: many apparent inverses are not
true inverses. When this happens, there may be an abstract domain
that can reconcile the differences.

Real Problems

SortTables

Development of the SortTables system provides an example of how
categorical notions can drive design decisions.
SortTables has a set of records, a current column (the sort
order), and a current row (indicated by a cursor bar). One of the
transformations available (arrows in the categorical framework) is
to change the sort order.

It is clear that the order of the records are changed by this
operation; but what about the current row? This was not addressed
by the original design, but came up during implementation. The
three possibilities were to leave it at the same position (rejected
as confusing), to put it at the first row in the new order, or to
locate the same record as the previous order and put it there.
The argument of trying to maintain as much information even
across the “functor” won the day for the latter choice. It’s
interesting that people learning to use the system did not at first
realize why the cursor remained on the record it did. They
understood that the order was different, but didn’t have a model to
tell them why the current line was where it was. After a short use,
they became accustomed to it, and agreed that it was the proper
behavior, as they often wanted to explore items in the neighborhood
of a given item when sorted by a different attribute.

Character Sets

Given a document in one character-set encoding, how do we display
it in another? The SortTables interface has this problem, as do web
gateways. (A source document might be in “ANSEL”, the target in
ISO-Latin-1.)

For example, the ANSEL encoding of US-MARC records encodes
diacritics as the diacritic mark (a value above 128) followed by
the letter. The ISO-Latin-1 encoding uses a single character for
the letter with the diacritic. Thus ANSEL “‘e” must map to
ISO-Latin-1 “é”. Unfortunately, not all ANSEL characters
have representations in ISO-Latin-1. (For example, the Polish
slashed-L is missing.) The question then becomes, “what to do about
missing characters?”

Two approaches are very common: either ignore the diacritic
(which is not acceptable in all languages), or introduce “special
translations” for the missing character (e.g., “*L” for the
slashed-L).

Looking at this problem in the category theory framework reveals
why both approaches must be regarded as hacks: it is not possible
to map completely back from ISO-Latin-1 to ANSEL, as we have lost
critical information. If the diacritic is ignored, we don’t know
whether to map “L”ISO-Latin-1 to “L”ANSEL or
“slashed-L”ANSEL. If the special translation is
introduced, we don’t know whether to map “*L”ISO-Latin-1

to “*L”ANSEL or “slashed-L”ANSEL. We might
decide to live with the second “solution”: perhaps “*L” never
appears in the source text. Or we can introduce special mapping
rules, perhaps involving control characters. (But this reduces
readability.)

What is the solution? In a sense, there is none. Our domains are
incompatible, and something has to give. A partial approach is to
map to a union character set. For example, the 16-bit
Unicode character set contains all the characters in both ANSEL and
ISO-Latin-1. By working in the common domain where possible, we
defer the problem.

Eventually, the piper must be paid, if we ever must move “back”
to ANSEL. If such back-translation is required, we will have to
define unambiguous representations for each Unicode
character.

But the example of the web gateway allows another way out. Some
web browsers “know” Unicode, in the sense that they support
character entities for Unicode characters. Thus, “Lukasiewicz”
might be encoded “ɕukasiewicz”. If we must deal with a
browser that doesn’t understand this much, we might use an even
more verbose solution: “<img src=”letter/Lslash.gif”
alt=”L/”>ukasiewicz”, wherein we display an image of the
required character.

We gain a lot from mapping to the Unicode domain, because it
solves other problems as well. Real systems don’t deal with just
two character sets: they have M input alphabets and N output
alphabets. By using a union character set, we only need M+N
mappings (M up, N down), rather than the M*N separate mappings that
would otherwise be required. (This is similar to the problem of
compiling M languages for N machines.)

Other Examples

(Needs expanding)

  • Structure of documents – bring on hierarchical documents
  • Variations and the Bible problem
  • Two printers, which is right?
  • Lessons: CT suggests importance of identifying domains.
  • New domains induce engineering decisions, i.e., economic
    tradeoffs.
  • Introducing correction:
              AD
               o
              ^  ^
       up    /    \  up
            /      \
           /  fix   \  
       D1 o<-------- o D2
           \         ^
            \       /
     print   \     /  unprint
              \   /
               v /
                o
    P1
  • Note that correction depends on original; objects become pairs
    (original, unprinted).
  • RightPages: image + text
               AD
                o
              ^  ^
       up    /    \  up
            /      \
           /        \  
    ascii o          o screen
           \         ^
            \       /
    unprint  \     /  display
              \   /
               v /
                o
                ^
                | scan
                o 
              paper
  • Views: some are orthogonal, some are dependent.
  • Translations, views, multiple views, leaky transformations.
    “Leaky transformations induce hidden views.” Lossy
    transformations.
  • Models: when can we go up and down?
  • How are multiple views related to each other?
  • The Myth of Conversion, the Myth of Perfection
  • Higher-level concepts in category theory: functors, adjoints,
    etc.

RightPages

The RightPages system uses optical character recognition and page
images to search for and present articles in an online library.
Page images are scanned in for display. The cover of the journal
is scanned in also. They also attempt to integrate the text of the
article. If the text is online, they will use that. If not, they do
optical character recognition to try to recover the text.

Clearly, OCR is not a consistent process. We already know that
it’s not clear what document is converged to if it’s cycled through
printing and OCRing (recall the unprinter example). However, they
take steps to avoid that problem: they try to ensure that each
document is scanned in only once.
Another problem is the inaccuracy of the OCR: they’re aware of
this too, as their interface encourages visual searching: by
displaying the covers of various journals, it encourages people to
look in places they already “know”.

[[Do a categorical treatment first, then discuss problems
identified in the system, their (RP) solution, and a CT
solution.]]

What are the categories? We have a scanned version of a
document, and a possible ASCII or OCR version. The ASCII version is
clearly a category:

objects = strings
        arrows = transforms on strings

The scanned image is too:

object = scanned image
        arrow = image edit

One approach:

OCR
        Image --------> ASCII

Another:

Image     ASCII
   OCR        / copy
        ASCII2

ie. convert image to ASCII.
Yet another:

       IMAGE X ASCII
             |
           IMAGE

(“forgetful functor”)
A document is thus two things: an image and some ASCII. They
certainly can’t be fully derived from each other. What does search
produce? It must produce both to enable future manipulation.

             OCR
            ______
           v      
     ASCII o       o IMAGE
            ______^
             print

(especially for OCR – what is recovered doesn’t include e.g., font
information).

Pen-Based Systems

Some ink …/__ meant to be a word. What can we say?
In the case of the Newton, it supposedly doesn’t keep ink form
around for later analysis. (That is, handwriting analysis is a
mode.) This may be a flaw.

Hierarchical Editing in a Linear World

Include other paper here

Multiple Hierarchy Document Model

TBD

Lessons

The stories and examples above suggest design rules that can be
useful to people working with multiple document formats.

Conclusions

TBD

Bibliography

  • Mayfield et al. Apply CT to IR.
  • Pierce. Good intro to CT.
  • Barr and Wells. More thorough treatment of CT.
  • O’Gorman et al. RightPages.
  • SortTables.
  • MARC ANSEL document.
  • Web browser document.

Superseded – OLD PARTS NOT RE-INTEGRATED

What is a View?

A view is a way of looking at things. To formalize the notion, we
will frame views in the language of category theory, a branch of
abstract mathematics dealing with the algrebra of functions. We
hope this will provide a more solid basis for the use of views, and
hope to use this basis to generate insight into the problem of
maintaining them.

Background and Direction

The programming language Smalltalk has developed views as part of a general framework for developing user interfaces. This framework is known as MVC, for Model-View-Controller. The Model is a single
entity, on which several Views and Controllers may depend. Views
are presentations of the model; Controllers allow interaction with
(and modification of) the model.

Smalltalk factors user interfaces this way to provide intellectual control of the process of software development. As Models tend to be more stable than Views, Views are made to depend on Models rather than the other way around.

In our categorical approach, we will be less concerned with
which is the Model and which the View, and more concerned with
translations between them.

Category Theory as Inspiration

A key idea in this research is that multiple views are important.
Category theory is a meta-model, providing a framework for moving
between views. We do not want to reject previous document models,
but rather to address them through a common framework.

This research originally set out to develop
yet-another-document-model: to more fully define what it means to
be a document with multiple hierarchies. Instead, it has evolved
toward a category-theory approach to dealing with documents.
Category theory is a branch of mathematics that attempts to provide insight into the question “What is the fundamental structure?” This approach requires identifying and manipulating the categories involved. Most document models have been ones for which an obvious category exists (e.g., sets and graphs).

The difficulty in using documents is often not working with them
in their intended domain, but rather moving them between domains. A category theory approach can provide insight into these
difficulties and perhaps suggest resolutions. Like [Mayfield etc.]
we adopt the notion and notation of categories, but don’t develop
new theorems in category theory. Rather, we use category theory
ideas to illuminate some dark corners in document models.
We will use some parables to demonstrate the utility of this
approach in small examples. Then we will apply it to several
extended examples, including ….