Getting a handle on exhibition catalogues: the Project CHIO DTD
      ===============================================================

Richard Light, CIMI Consultant

Abstract
========

This paper describes work being carried out by CIMI (the Consortium for
Interchange of Museum Information) on the analysis of exhibition
catalogues.  This is being undertaken as part of Project CHIO (Cultural
Heritage Information Online).  The project plans to use the SGML (Standard
Generalized Markup Language) standard to express the structure and content
of source materials, including exhibition catalogues.  The analysis that
was undertaken led to a particular view on how exhibition catalogues (and
by extension, any text-based museum information sources) could be marked
up to support retrieval of extracts relevant to a wide range of queries.
The process of analysis is described, and the resulting design decisions
outlined.  The paper concludes with an assessment of the possibilities
for information retrieval offered by this approach.


Background and history
======================

CIMI and Project CHIO
---------------------

CIMI (the Consortium for Interchange of Museum Information) is a consortium
of museums and museum bodies which have come together to further the cause
of pooling and exchanging museum information.  This is an area which has
been discussed both locally and internationally for many years, but with few
practical results.  CIMI aims to make progress at both a theoretical and a
practical level.

The theoretical (perhaps 'design' would be a better word) level involves
identifying existing standards which can help museums exchange data, and
in developing generalized information models that can be used for practical
interchange.  CIDOC (the Documentation Committee of the International Council
of Museums) has produced a Data Model which aims to encapsulate museum
information in a very generalized way, but this is of more value for
comparing different information standards than for actually interchanging
information.  CIMI's work will build on existing work such as the CIDOC Data
Model, and aims to provide a set of standards and procedures which is
actually implementable.

The practical level involves prototyping projects which can be used to apply
the standards and models developed so far.  This has two major benefits.
It provides a real-life test of the design framework, offering a major
learning experience for consortium members and potentially showing up any
problems at an early stage.  It also provides 'proof of concept' systems
which demonstrate the feasibility of the design approach.  These can
be used to educate and inform a wider audience within the museum community.

CIMI's current major project is Project CHIO (Cultural Heritage Information
Online).  This project aims to deliver at least 10,000 records of objects and
information about Folk Art as a searchable online resource.  This will
include the full text of exhibition catalogues, wall texts, etc. as well as
images and the more traditional museum database records.

CIMI and SGML
-------------

Some time ago CIMI decided to adopt SGML (Standard Generalized Markup
Language: ISO 8879) as an approved medium for the interchange of museum
information.  This standard offers a framework for encoding any document:
you can use any tags you like, but you must provide a description of the
tags you have used at the start of the document (as a 'DTD' - Document Type
Definition).

One input into CIMI's education on SGML was a pioneering DTD based on the
Art Information Task Force's Categories for the Description of Works of Art.
This DTD showed that SGML could be used to model even complex museum
concepts, but it also showed that a 'Data Dictionary' approach to design
meant that the DTD would not be appropriate for modelling free text sources
such as exhibition catalogues.

During 1994, CIMI set up an SGML Working Group, which brought together
consortium members (those with a particular interest in encoding issues)
and SGML consultants.  This group has pursued the issues of general SGML
dseign principles and worked towards the production of a 'CIMI DTD'.

A second attempt at a 'museum DTD' was made by the current author for CIMI
during the latter part of 1994.  This took as joint starting points the AITF
DTD (for the 'museum' concepts) and the Text Encoding Initiative (TEI)
Guidelines for Electronic Text Encoding and Interchange.  The TEI guidelines
and DTD were developed with Humanities texts in mind: thus they have some,
but not total, relevance to museum texts.  This second DTD was also
unsuccessful, perhaps because the attempt to combine two already complex
schemes led to an unwieldy and over-generalized DTD which lacked a clear
sense of purpose.

About this time (December 1994), the SGML Working Group met, and decided to
identify 'genres' of information that might have different information
requirements.  The idea was that formats such as exhibition catalogues,
brochures, and sale catalogues would contain distinctive information forms
that could be modelled in a DTD that would work for that 'genre'.  CIMI
could tackle each genre in turn, and so produce a comprehensive set of
'museum' DTDs.

An analysis meeting held last May in Washington pursued this idea, and looked
in detail at the content of exhibition catalogues and wall texts.  They were
found to have many of the standard features of published texts, with some
more museum-specific features (such as concise and formalized object
descriptions - 'tombstones') embedded within them.

The results of this meeting were considered in more detail at a design
principles meeting held in Berkeley in July.  The overall conclusion was
that - for exhibition catalogues and wall texts at least - the generic TEI
framework provided a good starting point for CIMI.  In particular, a cut-down
version of TEI called 'TEI Lite' was felt to offer a reasonable set of tags
without excessive overheads.  Rather than try for an 'all-singing all-dancing'
set of museum-specific tags, the Working Group agreed to overlay the TEI
framework with only those 'access point' tags required to support the access
needs of Project CHIO.  The CIDOC Data Model was consulted when designing
these access point tags.

Later this month [August 1995] CIMI members will attend a workshop in Halifax
to practise marking up real documents with the DTD that has been developed so
far.  This exercise will doubtless lead to further revisions of the DTD,
and will help to clarify how SGML should be applied to museum texts.
By Autumn 1995 CIMI intends to have a Project CHIO demonstrator system that
shows these texts in retrievable form alongside collections database records.
This will be a further test of the design principles behind the CIMI DTD.


Analysis of exhibition catalogues
=================================

During the analysis meeting in Washington, the SGML Working Group was
presented with a wide variety of document types: collections database
records, press cuttings, exhibition catalogues, etc.  It decided to
concentrate on exhibition catalogues, since (together with collections
database records) they represented the bulk of the material available for
Project CHIO, and were a good example of the sort of textual material
we hoped to encode.  Exhibition wall texts were felt to be sufficiently
similar in terms of their actual text, so they were included as well.

It was agreed that database records had different requirements, and would
need a separate DTD.  (The challenge there is not so much designing a DTD
that reflects Project CHIO's information needs, as in mapping a wide
variety of database formats - on the fly - into a single SGML structure.
But that's a challenge for another day!)

Although the Group did spend time looking at the actual content found in the
sample documents, much of the meeting was actually used to discuss broader
needs and design issues.

DTD design issues
-----------------

One 'big issue' was the approach to adopt to DTD design: should we adopt
a simple DTD that represents 'chunks' of text, and allows us to mark up
access points within them, or should we go for something more detailed
that allows things like lists, quotations, footnotes etc. to be marked
up?  After looking at examples of the types of textual feature found in
the sample exhibition catalogues, the group came to the firm conclusion
that it was important to be able to mark up all the significant features
of the source document.  A simplistic 'HTML-like' DTD would not be
acceptable.  This means that it will be possible to display the catalogue
(or selected parts of it) so that it looks reasonably like the original
source.  (The group did not feel that it needed to match the printed source
exactly, e.g. by replicating the actual fonts used.)

Should we invent a DTD from scratch, or adopt (and adapt) an existing one?
There was no enthusiasm for the idea of inventing a brand-new DTD: many
of the features found in the sample documents were common to any printed
book.  There are already a number of DTDs around which deal with books,
and so could be used as a starting point.  The Text Encoding Initiative
(TEI) framework for prose was found to support many of the features we
noticed, and so was felt to be a strong candidate.

information resource or document archive?
-----------------------------------------

Another issue was the role of the SGML-encoded catalogue.  Was it to be
an information resource or a document archive?  If it is primarily a
resource then it is acceptable to provide relevant fragments; if an archive
then the whole document must be encoded and made available.  The group
went for the 'information resource' interpretation.  The project goal, in
principle, is to include the whole catalogue, but if logistics prevent this
we won't be too worried.

This means that a mechanism is needed to ensure that any fragments are
provided with sufficient context to identify their source (i.e. the
document they come from, the section/chapter within that document, and
any credit line required by the document's copyright holder).

access points
-------------

We looked at the issue of 'access points', i.e. items of information that
are of particular interest to people who might search the Project CHIO
database.  Project CHIO aims to provide information to the average paying
museum visitor, so an over-academic approach has to avoided.  Existing
museum schemes such as the Art Information Task Force (AITF) Categories
for the Description of Works of Art and the CIDOC Data Model give us an
embarrassment of possible access points.  The problem is deciding which
to concentrate on for Project CHIO.  Analysis of visitor questions by
Kody Janney and Jane Sledge provided us with some candidates, compatible
with the CIDOC Data Model, such as Awards, Events, Marks, Materials,
Objects, People and Time-Span.  The sample questions they listed showed
that these simple ingredients could (all too!) easily be combined into
complex queries, e.g.:

     "What inscriptions are there on coins from the reign of Hadrian?"

Having identified that your enquirers want to know about (say) Materials,
there are two rather different cases to consider.  The first is where a
material is mentioned in the running text, and you want to identify it
as an interesting item of information that could be an access point:

     "... Possibly they [hooked mats] were made here [Nova Scotia] before
     1840, but the craft did not burgeon until after mid-century.  Before
     this, <material>linen</material> was used as a ground fabric, but in
     the 1850s, <material>burlap</material> made of <material>jute</material>
     fibres from India was introduced. ..." (Spirit of Nova Scotia, p.33)

This fragment contains mentions of three materials, one of which (jute) is
used to make another (burlap).  All three have been marked up with an SGML
<material> tag.  If you search for 'material = linen', you would expect to
be shown the point in the text where this term occurs.

The second use of access points is in describing the context, or topic, of
a 'chunk' of text.  In the above example, the preceding text has established
that we are talking about hooked mats (this extract is taken from a
'Mat-Hooking' section within the introduction to the Textiles part of the
catalogue) and Nova Scotia (the topic of the whole catalogue, but also
mentioned explicitly in the previous sentence).  Thus we want, in some way,
to be able to say "this section/paragraph/random chunk is 'about' hooked
mats".  This then allows us to answer questions of the type:

     "Tell me about hooked mats from Nova Scotia"

In this case, the enquirer would expect to be directed to the complete
'Mat-Hooking' section, not to a particular point within it.

links
-----

Our sample documents contain the usual types of intra-document links: from
a point in the text to a footnote (at the bottom of the page or the end of
the section), an illustration or a bibliographic citation; from one section
to another, etc.  TEI provides a generalized linking mechanism which deals
comfortably with all these cases.  We assume that if we use this mechanism,
it will be possible to have these links implemented as 'hot spots' which can
be clicked on while browsing the text.

In addition we noted the need to point outside the document, either to a
specific point within another catalogue, or to 'further information' about
a person etc. mentioned in the text.  (It may be, for example, that another
catalogue has a biographical entry about the artist of the work being
described.)  The first case is also dealt with by TEI, but the second can be
rather trickier.  If you are pointing to a specific place where the biography
is to be found, there is no problem, but in principle we want to have our
markup say "link to anything you might have on this artist - no I don't know
whether there is anything, or where it is stored".  We have discussed (in
earlier meetings) the idea of having a unique 'handle' for each access point
in order to support this approach.

what's different about exhibition catalogues?
---------------------------------------------

We started the meeting with the hypothesis that different types of museum
information resource, such as exhibition catalogues, might exhibit unique
features which distinguished them from non-museum documents and from each
other.  If this is the case, they would require, or at least benefit from,
separate DTDs reflecting their different organisation.

At the 'macro' level we found that, while the organisation of individual
catalogues varied widely, they all fell within the bounds of what the TEI
defines as a standard book structure.  In other words, they all have front
matter, body text and back matter, and each of these parts is organised
into 'divisions' (sections, chapters, etc.).  Within divisions you get
standard textual features (headings, lists, footnotes, citations,
illustrations, etc.).  So, in general, we found no immediately obvious
characteristics which would distinguish exhibition catalogues from other
textual sources.  It's true that they have a higher than usual number of
'floating' elements (illustrations and text blocks with no obvious point
of attachment within the main text), but this is just a matter of degree:
TEI has a mechanism for dealing with them.  Similarly, it is very important
to Project CHIO that the language of each piece of text is identified
clearly, but TEI also provides support to do this.

The group agreed that TEI offered a potential basis for our DTD.  There
is a cut-down version of the TEI prose DTD, called 'TEI Lite', and this
was recommended as a starting point for customization.

'tombstones'
------------

One feature which we did find is that most exhibition catalogues contain
pithy object descriptions (called 'tombstones' within the group) which
consist of a formalized set of key terms, optionally followed by a less
structured free text description, e.g.:

     2 Yarn Sewn and Hooked Mat with House
     Merigomish, Pictou County
     Wool on linen base
     Artist unknown (probably by member of MacKay family)
     Circa 1855-1865
     Dimensions: 163.0 x 82.0 cm (64 3/16 x 32 5/16 inches)
     Collection of the Nova Scotia Museum, Halifax, Nova Scotia

     This is one of the earliest mats known from Nova Scotia which is both
     hooked and yarn-sewn with home dyed and homespun wool.  The
     house which is depicted was built in 1852, and may be the home of the
     donor in which the mat was found, but which has been greatly altered
     through the years.

     Structure: Hooked rug all home-dyed, homespun wool yarn - some
     parts hooked, others "yarn-sewn" with a needle.  Clipped pile, very thick
     in the center, thinning towards the edges.  Handwoven linen backing
     with top and bottom selvedges.  Hole lower right corner - patched hole
     upper center edge - pile on lower edge worn away.

     Refer to colour plate V, page 14

     (Spirit of Nova Scotia, p.35)

The information given in the formalized first part of the 'tombstone' tended
to be consistent within individual exhibition catalogues (or at least within
sections of catalogues), but there was no consistency between different
catalogues as to what information is provided, or what order it occurs in.
The topic of the free text part of the 'tombstone' tends to vary even within
a single source.

We found examples of 'tombstones' for people, organizations and (museum)
events as well as for objects.


The CHIO DTD: design principles
===============================

Following the analysis meeting, we looked in more detail at issues (like
language representation) which matter to CIMI in general, and to Project
CHIO in particular.  Our overall conclusion was that TEI Lite was a good
starting point, offering generalized ways of dealing with most of the
issues identified.

TEI compatibility
-----------------

We agreed to start, not from the TEI Lite DTD itself (which is fixed), but
from the modifications file used to generate TEI Lite.  We then modified
the modifications (!) to suit our own needs, removing tags that are not
required and adding back in 'standard' TEI tags that we wanted.  Finally,
we added a set of additional tags to express the access points required for
Project CHIO.

All of this has been done in the manner prescribed by the TEI Guidelines,
so hopefully the result will be a TEI-conformant application.  This has
two positive effects: it means that we should be able to use TEI-aware
software directly on these texts, and it means that our exhibition
catalogues are marked up in a manner which makes them useful to the TEI
(i.e. academic) community worldwide.

access points
-------------

As noted above, we have developed a set of tags corresponding to the
Project CHIO access points that are not already supported by TEI.  (Some,
such as person, organization and place names, were already included.)  These
are to be used for mentions of the topic within running text, as described
above:

     "... Possibly they [hooked mats] were made here [Nova Scotia] before
     1840, but the craft did not burgeon until after mid-century.  Before
     this, <material>linen</material> was used as a ground fabric, but in
     the 1850s, <material>burlap</material> made of <material>jute</material>
     fibres from India was introduced. ..." (Spirit of Nova Scotia, p.33)

Some items of information originally defined as access points could also, or
instead, be considered as 'attributes' of other access points.  For example,
a Role is usually played by a person or organisation.  To deal with these
cases we have added suitable attributes, for example a ROLE attribute for
all 'name'-type elements.

normalization
-------------

When these access points are found within running text, the way they are
expressed will often not be useful for searching.  An obvious example is
a person's name.  At the very least, names are hardly ever expressed in
inverted form (i.e. last name first), yet this is the form which is
routinely used for searching e.g. telephone directories.  Beyond that,
the actual form of name given may vary (forenames given in full, as
initials only, or omitted), even within a single source.  One possible
approach is to analyse the components of names, but this adds extra
marking-up work, and is no help when the information is not there to
mark up!

The approach we have adopted (already supported by TEI) is to have the
option of including a normalized form of the entry as a REG ('regularized')
attribute, e.g.:

     <persName REG="Light, Richard B.">Richard Light</persName>

This regularized form should be compatible with the TOPIC attribute described
below.

broader contexts
----------------

For defining the context of larger 'chunks' of text, we adopted a different
approach.  Past experience has shown that problems can arise if you invent
tags with a specific museum meaning, and then try to get them to cohabit with
the more neutral tags used to mark up the basic structure of the document.
For example, if a paragraph (<p>) is also an object description
(<objectDesc>), what do you do?  If you mark it as a <p> only, you lose the
piece of information which is most important to you.  If you mark it as an
<objectDesc> only, how do you indicate that it is also a paragraph (for
example, when formatting it for a browser)?  If you mark it as both, does
the paragraph contain the object description, or vice versa? ...

So we have decided to abandon this approach, and find some other means
of indicating the context of a 'chunk' of text.  We have provided a TOPIC
attribute which can qualify any element within the catalogue, e.g.:

     <p TOPIC="object.NMAA.1980.54"> ...  </p>

states that this paragraph is 'about' the NMAA object with accession number
1980.54.  The TOPIC value has two parts: a prefix ('object') which says what
type of access point is involved, and a normalized term ('NMAA.1980.54')
which is unique to this object.  (This part of the TOPIC should be the same
as the REG attribute used to normalize an object's name: see above.)

The standard TEI tags allow us to mark whole sections, single paragraphs
and random parts of the text; we can then assign them a TOPIC in this way.

This approach is not the most elegant one possible - for example if a
paragraph has two topics they must both be declared in a single TOPIC
attribute, because SGML attributes are not repeatable.  It also lacks the
formality of a scheme where every TOPIC is taken from a centrally-controlled,
formally-approved authority list.  However, it is felt to represent a
sensible compromise that CIMI members might actually be willing to implement!

HyTime compatibility
--------------------

There is another standard lurking in the wings: HyTime is a standard (based
on SGML and fully SGML-compatible) for 'Interactive Open Hypermedia'.  The
SGML Working Group has agreed in principle that the CHIO DTD should be
HyTime-compatible.  It should be possible to achieve this without changing
the design principles of the CHIO DTD.

One potential benefit of adopting HyTime is that all the retrievable
resources can be 'pulled together' by what is called a HyTime Hub Document.
This is an SGML document which has links to all the SGML and non-SGML
resources in the system, acting as the 'glue' which holds together
SGML-encoded documents, database records and images.


Retrieval possibilities
=======================

Once CIMI members have marked up their exhibition catalogues with SGML
tags, the next challenge will be to use these resources to answer the
qusetions posed by Project CHIO's public.  SGML documents are a new kind
of beast in the information retrieval world: they are not relational data,
and are most definitely not 'free text'.  What sort of possibilities will
they offer for retrieval?

'tell me about ...'
-------------------

Well, the easiest sort of question to answer will be an unqualified 'tell
me about' one, as mentioned above, e.g.:

     "Tell me about mat-hooking"

If one or more 'chunks' of text exactly match the TOPIC requested, they can
be delivered as a response.  The enquirer gets exactly what they asked for:
some text 'telling them about' the person, object, or whatever they expressed
an interest in.  The fact that it is a piece of prose rather than a database
record will probably be seen as an advantage by the enquirer.  The only issue
we have discussed in this context is: do you give the enquirer just the
'chunk' that answers their query, or do you turn them loose on the whole
catalogue of which it forms a part?

Things rapidly get more complicated.  If the question above is extended
slightly:

     "Tell me about mat-hooking in Nova Scotia"

the same subsection from Spirit of Nova Scotia is still relevant, but it
is much harder to work out that this is true.  You are still looking for
'chunks' with a given TOPIC: the complication is that one of the TOPICs
(Nova Scotia) will have been applied to the whole catalogue, while the other
(mat-hooking) only applies to a single subsection.  It is clearly the
subsection, not the whole catalogue, which should be found: the question is
how you define a search strategy that will deliver this result.

One approach (which is intuitively obvious to people) is to say that every
chunk 'inherits' the TOPICs of all its ancestor chunks.  With this model,
the subsection would have (at least) three TOPICs, two of them inherited:

   method: mat-hooking
   object: textiles    (from the section)
   place:  Nova Scotia (from the whole document)

Then the search becomes simple again: you are just looking for a single chunk
with all the required TOPICs.

Another approach (which might be easier for a computer to implement) is to
search for the TOPICs 'in place', and then see if any chunk with a relevant
TOPIC is nested within another chunk with a relevant topic.  Again, the
smaller of the chunks is the one retrieved.

access points in running text
-----------------------------

So far we have just talked about 'retrieval by TOPIC'.  What about all of
those mentions of materials, people, dates etc. in the text which have been
lovingly marked up?

For the purposes of Project CHIO I would suggest that these are viewed as
properties of the smallest enclosing chunk with a TOPIC attribute.  For
example, the section describing materials that we used earlier occurs
within the mat-hooking subsection (to which I have added a second TOPIC):

     "<div3 TOPIC="method.mat-hooking//object.hooked mats">
     ... Possibly they [hooked mats] were made here [Nova Scotia] before
     1840, but the craft did not burgeon until after mid-century.  Before
     this, <material>linen</material> was used as a ground fabric, but in
     the 1850s, <material>burlap</material> made of <material>jute</material>
     fibres from India was introduced. ..." (Spirit of Nova Scotia, p.33)

We can, in effect, add all the access point terms marked up within the
subsection to the access points implied by its TOPICs:

   method:   mat-hooking
   object:   textiles    (from the section)
   place:    Nova Scotia (from the whole document)
   material: linen
   material: burlap
   material: jute

(Note that, if the REG attribute has been used to provide a normalized
version of an access point term, this is what should be used here.)
This allows us, in principle, to use this subsection to answer questions
such as:

     "What types of object were made from burlap?"
     "Tell me about materials used in hooked mats."

However, the nature of the association between individual access points and
the overall subsection cannot always be clearly deduced.  For example, if
'India' were marked up as a <placeName>, this could give a (false) answer
to the question:

     "Where were hooked mats made?"

Two ways round this problem are: be very selective as to what is marked
up (only mark 'obviously relevant' access points); or go to the other
extreme and do a complete semantic analysis of the text.  I am not sure
that the former approach will completely avoid 'false drops', and it
might lead to missed matches.  The latter approach is out of the question:
marking up the access points and basic text structure is quite enough
overhead for this project, and it is not clear that a search engine would
be sophisticated enough to make good use of the additional markup anyway.

there's always free text
------------------------

As well as these two types of structured information (TOPICs and access
pointsin the text), we have the text of the document itself.  While we are
obviously hoping that our careful marking up and normalization work will lead
to improved precision of retrieval via controlled terms, it would be silly to
ignore the rest of the text as a potential resource for answering questions.

If an enquiry uses a specific word or phrase which has not been separately
marked up, it should be possible to search for it throughout the text.  Each
word has a context, because it will fall within an SGML element of some sort.
This is better than searching through a totally unstructured text.  If the
required word falls within an element with a defined TOPIC, then it can be
thought of as an additional 'property' of that chunk of text.  This allows
non-marked-up words to be used as part of a complex query with more than one
parameter, e.g.:

     "Tell me about objects associated with sacking."

However, you have the usual problem with free text that you cannot specify
that you mean the material 'sacking', as against the gerund.

authority files
---------------

The approach of naming TOPICs and normalizing access point entries will be
most effective if it is applied consistently by all participants in Project
CHIO.  This suggests the need for some sort of authority file giving the
'approved' form of each term.  We are going to investigate how far this can
be achieved in practice as the project proceeds: there are clearly logistical
difficulties in coordinating markup which is taking place simultaneously at
several sites.  Existing authorities (such as the Art and Architecture
Thesaurus, and the Unified List of Artists' Names) will be used wherever
feasible, but they won't cover all the ground.  Certainly we have fought shy
of designing a system which requires out-and-out consistency from the start
(e.g. via what SGML calls Formal Public Identifiers).  There will probably
be a process of review and harmonization once some real exhibition catalogues
have been marked up.

Another way in which authority files are potentially useful to this project
is in the expansion of queries.  If an enquirer asks for "material = wood",
they are probably going to be interested in all mentions of different types
of wood ('pine', 'oak', etc.).  A materials thesaurus can take 'wood' and
return all the more precise terms: these can then be added to the query.
We decided that this was the only feasible way to support query expansion:
the alternative of adding broader terms within the text was rejected as
inefficient and too labour-intensive.

query language
--------------

The retrieval side of Project CHIO will have to solve the intriguing
question of what query language(s) to use.  In principle, every question
should be put, equally, to the collections database and to the SGML
documents in the system.  This requires as a minimum a common front-end, and
facilities for mapping queries to at least two different query languages.
In its later stages, Project CHIO will be using a Z39.50 application profile
to support querying: it will be interesting to see what that can offer.

Providing a common front-end for queries still leaves the question of what
query language to use for the SGML side of the project.  At the present time
(Summer 1995) the SGML world has not standardized on a single query language,
although I understand it is moving towards adopting one.  The SQL language
used by relational databases is probably not relevant: the query language
adopted needs to express SGML document properties, and complex combinations
of  such properties.  However, the fact that we are not planning to use the
more arcane possibilities offered by SGML may make the task simpler.