Final Report, Language
Digitization Workshop
D. H. Whalen
The workshop was quite successful and led to some very useful yet accomplishable goals. These will come later. But first, it became apparent that there were several paradoxes in what linguists wanted from the web:
1 We wanted to use extant tools but we wanted to be able to satisfy specific needs.
2 We wanted to make the data available, but feared that inadequate attention would be paid to the limitations of the data.
3 We want to be able to describe the data in just the way we want, but we don't want to program it.
4 We know that all languages are the same, yet we know that all languages are different.
Resolving these paradoxes will not be easy (especially the last one, since it is a genuine fact of the field).
The three working groups focused on aspects of language that will have to be included in the markup on the web. The language codes are complicated not because linguists insist, but because that is the way the world is--there are lots of languages. The markup and metadata fields are more under our control, and will be complicated because linguists have not agreed on which descriptive tools are necessary.
The Trail of Tiers: When linguistic markup first became an issue, it seemed that three tiers would be needed: a phonological transcription, a literal, morpheme-level translation, and a free translation. This is not sufficient for a variety of kinds of work, and the number of tiers seems to need to be unbounded. Lieb and Drude, for example, have proposed 26 tiers for morpho-syntax alone. The phonological level is not sufficient if we consider, at a bare minimum, the phonetic transcription as well as the phonological (there is no way to test new phonological rules otherwise). But Articulatory Phonology (of Browman and Goldstein) requires at least another 10 tiers (happily, already called "tiers" in that theory), and a variety of autosegmental levels could require many more. These new tiers, however, must be searchable in a principled way across data sets, or there is no advantage of standardization. This will be a challenge that has not really been addressed yet, of how to search across multiple tiers for tags that may be near-synonyms but not exact ones.
Multiple Time Series:
When most of these data sets are described, they assume that there is
one time series, the audio recording.
Many recordings now include at least one other, the video image, which
is a time series of a different degree of complexity. But, at least in phonetics, there are many other possible time
series, such as an associated pressure reading, jaw displacement based on a
strain gauge, a nasal accelerometer reading, etc. If we look at electromagnetic
midsagittal articulometry, then we have upwards of 30 associated time series
just for the raw data. Then if we
consider that there are many derived signals that may be of interest, such as
fundamental frequency, lip aperture, and velocity of tongue dorsum movement (to
name a few), the range of time series also expands in the same way the tiers
do. It is not clear that XML handles
these very well, and it may be that this will force us to consider some
relatively drastic modifications.
The
Language Codes working group decided to form a consortium to recommend language
names and codes. The consortium will
work with SIL on improving coverage, and will also work with the Unicode and
ISO groups to make this work. ISO
apparently wants the full coverage of languages to use a four-letter code,
since they have already used their three-letter code for the woefully
inadequate 200 language list. My own
recommendation was that we could accommodate that by having the first character
indicate whether the
language is Ancient (Axxx), Basic (Bxxx), or Constructed (Cxxx). This would allow us to deal with Hittite and
Klingon in a principled way, and would make the SIL codes essentially remain at
three letters. In addition, the
consortium will work on a way of implementing alternate trees for language
families, since these are not agreed upon in general.
The Metadata
working group agreed to start with the OLAC set, and to coordinate with other
groups working on this problem. The
group felt that detail needed to be added to the "type.linguistic"
field, and that it would probably take a continuation of this working group
in order to make progress on that issue.
The Markup working group
came to one early decision, that XML would be the markup language. It later became less clear (at least to me)
that this will be a fully acceptable decision, but at least it was a decision.
It was also agreed that a basic set of lists of markup tags will be
generated. The group will start with additions to and modifications of that
proposed for EUROTYP, which is being published in July 2001. This is for morphosyntax only, and generating
an adequate list for other areas will be a larger project. Will these ever be agreed upon? It seems unlikely, so it is probably necessary
to have search engines that can combine across a variety of tags so that one
researcher's "punctual" can be equated with another's
punctative", or not. Tool development
was seen as essential, since none of these markup schemes will be used unless
they are easy to implement. The need
for a 'confidence level' indication was also pointed
out. Although a global scale could
be used, it would be more helpful to have detailed indications (e.g., 'segments
are correct but tone was mostly ignored'). A listserv will be implemented
to continue discussion of these issues, and an advisory board (or boards)
should be set up.
Target
dates:
Language
Codes:
------By July: Set up listserv. (DONE.)
------By August 31: Establish Consortium.
------Before
Sept. Unicode meeting: Write reaction to Unicode scheme.
------By
Sept.: Contact ISO about four-letter
scheme.
------By
Oct.: Establish consortium feedback area within Linguist List.
------By
Feb, 2002: Finalize initial decisions.
Metadata:
-----By Sept.: Harvest data from existing databases.
------By Sept.: Make formal rec. to OLAC for including "dict" and "genre".
------By Oct.: Make list of existing tools.
------In Dec.: Present Emeld experience to OLAC in Philadelphia.
Markup:
------By July: Set up listserv. (DONE.)
------By Aug.: Send out EUROTYP tags for comment by community.
------By Aug: Send out request for phonetic and phonological tags.
------By end of Oct.: Collate EUROTYP responses.
------By Feb, 2002: Publish initial list of morphosyntactic tags.