Metadata Issues

Nicholas Ostler

General Points about OLAC

·        There is no requirement  for one-one correspondence of OLAC descriptions and data resources.  In particular, it is quite acceptable for a number of OLAC descriptions to be set up reflecting different  aspects of a resource.

·        OLAC is not being designed for purposes beyond the effective accessing of resources (e.g. to build up User Profiles in the light o past searches)

·        OLAC content will be kept minimally adequate for the above accessing purpose ("broad and shallow").  It might be metaphorically viewed as "design principles for the catalogue cards of the Library of Congress".

·        OLAC cannot change the Dublin Core elements, though it can add to them or add refinements

·        E-MELD and the LINGUIST List have no intent to be exclusive  metadata harvesters.  It is a merit of an open standard like OLAC that any organization that wishes can compile this data.  However, other organizations notably North-Western University have declared that   they will adopt a different strategy, with active  searches for materials.  It is the judgement of LINGUIST that this is likely to prove less effective than their own, namely  the soliciting  of standard (OLAC) reports on materials held.

Relations of OLAC with IMDI

There is a need to harmonize the controlled vocabulary of these two schemata for metadata, especially in the field of content descriptions.

But no decision can be made in this forum, since each will have to consult on any changes with its own community. 

Some immediate differences of background:

·        IMDI must also take account of Language Engineering requirements;

·        IMDI has a focus on storage of primary data

·        IMDI evidently offers much finer analysis of the resources it describes

To achieve Inter-operability of OLAC and IMDI

·        IMDI might define wrappers for sets of resources, with pointers to OLAC entries

·        IMDI might offer different content blocks for field linguists, and for language engineers

·        OLAC allowed for the inclusion in the TYPE element of an indefinite number of Attribute-Value pairs (as indeed any arbitrary text material). These could express IMDI content that went beyond OLAC, though this would evidently not be accessible to a OLAC search.

·        OLAC records could bear an IMDI icon, showing that they  were susceptible of more sophisticated (IMDI) search

·        Overall, it would be necessary to have further, and specific, technical discussions on how the two formats would interact, and this might  be a suitable task for funding within the EMELD grant

Peter Wittenberg declared that, where possible, IMDI would carry over th same terminology used in OLAC, subject to the provisio that this remained an open standard.  (Language codes derived from SIL might not be open in the required sense.)

Modes and Tools for  Metadata  Entry

Three were suggested:

1.      a browser window on a web page

2.      a downloadable  metadata editor (particularly useful in the Australian ASEDA experience)

3.      a means of harvesting and transforming relevant entries in a pre-existing catalogue, effectively to add a new view on it (consiidered as a database)

Any of these might diminish the overhead (often in finding available personnel) of producing OLAC-compatible metadata

Comments on Specific Elements

Creator / Contributor

In various ways this contrast  (inherited from DC) was ambiguous in the context of pan-linguistic  discourse. Notably:

·      which was which in the standard situation with an Interviewee/Informant/Consultant  and an Interviewer/Researcher?

·      should the audience be rated Contributor (or even a Creator) in cultures where a listener was required to attend any proper performance (e.g. Mayan, Maori discourse)?

·      where a data set included many speakers, should the Collector (in IMDI teminology) be termed the Creator?

The meeting did not arrive at any recommended answers to these questions. The actual element names could not be changed to somethign less ambiguous since they were set by DC.  At best it would be possible for some guidance on best practice to be offered.

At any rate, different guideline should govern a recorded Q&A session, from a recorded monologue, such as a narrative.

A major source of ambiguity was the double purpose to which the Creator element was likely to be put: viz primary key for search, and holder of primary rights to distribution. It was agreed that this second interpretation was strictly illegitimate, since DC offered a Rights element in which such information could be specified. Also, it was of course possible for search to range over Contributor as well as Creator, so neither criterion was strictly necessary for assigning these element titles.

Interpreter was a sub-type of this role which was itself ambiguous: as between an actual interpreter present at a discourse session, and someone who subsequently  assisted a transcriber of the the recording.

Gary Simon noted that:

1.    The distinction was more one of rank, than of function.

2.    There could be a controlled vocabulary for subtypes of Creator or Contributor, e.g. {Transcriber, Elicitor...}, probably drawn from a single set of terms.  This has not (yet) been provided.  The closest thing in IMDI was the {Participant.Type} terms, which did not constitute a complete set.

3.    A further ambiguous type of Creator/Contributor was the person responsible for having deposited a mirror copy of an archive from somewhere else.

Coverage

This is defined as"spatial and temporal aspects of aboutness”, the effective range of the topic of discourse. It does not refer to the Date (of some event in the life history of the record).

Date

This could be refined e.g. as Created, Modified, Issued, Transcribed...

Description

This contained free text. There was no need to introduce Table of Contents, or Abstract as refinements.

Format

Primarily based on MIME formats: text/html, text/plain, audio/wav, image/bmp ...  These were defined in RFC documents submitted to the IETF.  It was possible to expamnd this list (e.g. for CHAT) by listing it as "text/plain", and then specifying its technical constraints in Format.markup.

It could be a drawback if the full range of MIME formats continue to be allowed, since this might make a Format assignment tool hopelessly cumbersome.

The general constraint on DC elements was that they must be equally applicable in principle to all data types.  Hence one should not try to include more medium-specific attributes (such as sampling rate for a sound file).  In this, OLAC contrasted with IMDI (with its MediaFile, and Recording Conditions).

It seemed unnecessary to specify the coding within a text files (as RTF, Word95) since this could be automatically picked up by a reading device.

Format.cpu, Format.os, Format.sourcecode

These elements made sense of Tools only (not datat media) and so were exceptions to the general principle that elements must be equally applicable in principle to all data types.  They would also be releveant to ceratin forms of data (if a specific tool was needed to unpack them. 

There was an unresolved problem: who would enter this type of data?

Type.data

It was recommended that this be replaced by
Type.linguistic

Initially, four values were recommended for type.linguistic:

·      transcription (of spoken data)

·      annotation (of spoken data)

·      description (of spoken data)

·      lexicon

It was suggested that where the text was original (i.e. not derived by transcribing an oral perfomance), an new value might be added

·      text

The above had been recommended as a replacement for a prior open-ended list:

·      grammar

·      lexicon

·      word list

·      ...

Types of audio data were supposedly classifiable under Format.

However, the dubious case of mixed data was raised (e.g. illustrated text) —inconclusively, and a competing suggestion was made to reinstate a list of Genres, perhaps as a subtype of one of the Type.linguistic values.

In general it was accepted that the subclassification of this element needed a considerably extended discussion in its own right.

See http://www.language-archives.org/private/type.html for the current state of the art on this discussion.

Other questions raised and not answered (perhaps misunderstood by  this scribe):

It was not clear how OLAC could refer to an external data set.  (Answered by remark on possible non-uniqueness of OLAC descriptions to a reource.)

Other questions raised in discussion

It was pointed out that formatted databases (e.g. relational databases of the type presented by Sergei Starostin) have no clear status in the OLAC system. But Gary Simon noted that Dataset was one value of DC Type.