Final Report, Language Digitization Workshop

D. H. Whalen

 

The workshop was quite successful and led to some very useful yet accomplishable goals. These will come later. But first, it became apparent that there were several paradoxes in what linguists wanted from the web:

 

1                 We wanted to use extant tools but we wanted to be able to satisfy specific needs.

2                 We wanted to make the data available, but feared that inadequate attention would be paid to the limitations of the data.

3                 We want to be able to describe the data in just the way we want, but we don't want to program it.

4                 We know that all languages are the same, yet we know that all languages are different.

 

Resolving these paradoxes will not be easy (especially the last one, since it is a genuine fact of the field).

 

The three working groups focused on aspects of language that will have to be included in the markup on the web. The language codes are complicated not because linguists insist, but because that is the way the world is--there are lots of languages. The markup and metadata fields are more under our control, and will be complicated because linguists have not agreed on which descriptive tools are necessary.

 

The Trail of Tiers: When linguistic markup first became an issue, it seemed that three tiers would be needed: a phonological transcription, a literal, morpheme-level translation, and a free translation. This is not sufficient for a variety of kinds of work, and the number of tiers seems to need to be unbounded. Lieb and Drude, for example, have proposed 26 tiers for morpho-syntax alone. The phonological level is not sufficient if we consider, at a bare minimum, the phonetic transcription as well as the phonological (there is no way to test new phonological rules otherwise). But Articulatory Phonology (of Browman and Goldstein) requires at least another 10 tiers (happily, already called "tiers" in that theory), and a variety of autosegmental levels could require many more. These new tiers, however, must be searchable in a principled way across data sets, or there is no advantage of standardization. This will be a challenge that has not really been addressed yet, of how to search across multiple tiers for tags that may be near-synonyms but not exact ones.

 

Multiple Time Series: When most of these data sets are described, they assume that there is one time series, the audio recording. Many recordings now include at least one other, the video image, which is a time series of a different degree of complexity. But, at least in phonetics, there are many other possible time series, such as an associated pressure reading, jaw displacement based on a strain gauge, a nasal accelerometer reading, etc. If we look at electromagnetic midsagittal articulometry, then we have upwards of 30 associated time series just for the raw data. Then if we consider that there are many derived signals that may be of interest, such as fundamental frequency, lip aperture, and velocity of tongue dorsum movement (to name a few), the range of time series also expands in the same way the tiers do. It is not clear that XML handles these very well, and it may be that this will force us to consider some relatively drastic modifications.

 

The Language Codes working group decided to form a consortium to recommend language names and codes. The consortium will work with SIL on improving coverage, and will also work with the Unicode and ISO groups to make this work. ISO apparently wants the full coverage of languages to use a four-letter code, since they have already used their three-letter code for the woefully inadequate 200 language list. My own recommendation was that we could accommodate that by having the first character indicate whether the language is Ancient (Axxx), Basic (Bxxx), or Constructed (Cxxx). This would allow us to deal with Hittite and Klingon in a principled way, and would make the SIL codes essentially remain at three letters. In addition, the consortium will work on a way of implementing alternate trees for language families, since these are not agreed upon in general.

 

The Metadata working group agreed to start with the OLAC set, and to coordinate with other groups working on this problem. The group felt that detail needed to be added to the "type.linguistic" field, and that it would probably take a continuation of this working group in order to make progress on that issue.

The Markup working group came to one early decision, that XML would be the markup language. It later became less clear (at least to me) that this will be a fully acceptable decision, but at least it was a decision. It was also agreed that a basic set of lists of markup tags will be generated. The group will start with additions to and modifications of that proposed for EUROTYP, which is being published in July 2001. This is for morphosyntax only, and generating an adequate list for other areas will be a larger project. Will these ever be agreed upon? It seems unlikely, so it is probably necessary to have search engines that can combine across a variety of tags so that one researcher's "punctual" can be equated with another's
punctative", or not. Tool development was seen as essential, since none of these markup schemes will be used unless they are easy to implement. The need for a 'confidence level' indication was also pointed out. Although a global scale could be used, it would be more helpful to have detailed indications (e.g., 'segments are correct but tone was mostly ignored'). A listserv will be implemented to continue discussion of these issues, and an advisory board (or boards) should be set up.

 

Target dates:

 

Language Codes:

------By July: Set up listserv. (DONE.)

------By August 31: Establish Consortium.

------Before Sept. Unicode meeting: Write reaction to Unicode scheme.

------By Sept.: Contact ISO about four-letter scheme.

------By Oct.: Establish consortium feedback area within Linguist List.

------By Feb, 2002: Finalize initial decisions.

 

Metadata:

-----By Sept.: Harvest data from existing databases.

------By Sept.: Make formal rec. to OLAC for including "dict" and "genre".

------By Oct.: Make list of existing tools.

------In Dec.: Present Emeld experience to OLAC in Philadelphia.

 

Markup:

------By July: Set up listserv. (DONE.)

------By Aug.: Send out EUROTYP tags for comment by community.

------By Aug: Send out request for phonetic and phonological tags.

------By end of Oct.: Collate EUROTYP responses.

------By Feb, 2002: Publish initial list of morphosyntactic tags.