Pino, Marta: SGML Encoding two large Spanish corpora with the TEI scheme: design and technical aspects of textual markup [Mirrored from: http://www.cs.vassar.edu/~ide/DL96/pino.txt] Marta Pino. Computational Linguistics Department. Instituto de Lexicografía. Real Academia Espanñola. E-mail: mpino@crea.rae.es 1. Introduction The Lexicographic Institute of the Royal Spanish Academy is compiling two large corpora: a reference corpus of modern Spanish, called CREA (Corpus de Referencia del Español Actual), and a historical corpus, known as CORDE (Corpus Diacrónico del Español). CREA is a monitor corpus that covers the last 25 years of the language. This means that once the corpus is completely compiled, it will cover all the varieties of Spanish language use from 1975 to 2000. It will contain 200 million words of running text, providing an empirical basis for lexicographic and grammatical research. In the present stage, it has 8 million words, partially encoded. CORDE is a corpus of 80 million words that covers the rest of the history of Spanish: from the origins to 1975. Since CREA is a monitor corpus, it will periodically pass the oldest texts to CORDE. Both corpora are being encoded and morphosyntactically tagged. In the future there may be also syntactical and pragmatical information associated to the texts. The aim of this paper is to show the main principles of the encoding scheme applied to these corpora, focusing on some particular encoding problems and their TEI or non-TEI solutions. 2. Overall structure of the encoding scheme for CREA and CORDE 2.1. Structural classification of the TEI.2 documents Within each corpus there are two main types of TEI.2 documents, which correspond to the difference between unitary (UT) and composite texts (CT). On the one hand, there are autonomous textual units, like single books or any other object published independently. These ones are called "unitary texts", and correspond to TEI.2 documents that have a structure composed of a header and a text, subdivided into front, body and back. On the other hand, there are texts that, although constituting also independent objects, have a more complex structure, made of different and relatively single texts, like newspapers, magazines or anthologies. Since these texts are made of texts, they are "composite texts", and give rise to TEI.2 elements that consist of a header and a text, which in its turn includes a front and a group of texts with their own structure of front, body and back. In this paper, we will use the term "nested texts" (NT) to refer to the texts included into a composite text, to differentiate them from the unitary and the composite ones. 2.2. General structure of each corpus as an SGML document In order to convert a corpus into an SGML document, it has been necessary to add certain type of markup to the texts, but also to associate the corpus with an SGML declaration and a formal definition of type of document (DTD). Figure 1 shows the way the texts are organized in the corpus. There is a header for the whole corpus and a series of TEI.2 elements. Some of these TEI.2 documents are unitary, and some of them composite. Fig. 1. Overall structure of a corpus as an SGML document: [instance of unitary text] ... ... ... ... ... ... [instance of composite text] ... ... ... [instance of component text] ... ... ... [instance of component text] ... ... ... --> As the figure shows, there is another component, apart from the corpus, within the SGML document. It consists, on the one hand, of an SGML declaration, which contains certain technical details concerning the variety of SGML parameters selected, and some other codes specially useful for text interchange. On the other hand, there is a DTD associated to the corpus, which is the document where all the SGML elements and entities used in the texts are declared. Here are some examples of these two aspects: Fig. 2. Fragment of the TEI SGML Declaration adopted for CREA and CORDE: Fig. 3. Fragment of the DTD written for the two corpora: The corpus itself consists of a TEI header and a series of TEI.2 documents, which are also divided into TEI header and text (unitary TEI.2 documents) or into TEI header and group of texts (composite TEI.2 documents). The next sections will analyse differe nt encoding aspects of the corpus. 2.3. Types of references associated to the corpus As figure 1 showed, there are some element with an identifier attribute. The function of this information is to differentiate parts of the corpus by means of an unique code. But not all the references used in the corpus are of this type. We have distinguished four main types of references, that correspond to the next four paragraphs. 2.3.1. Identification references The first type of reference is the one that allows to identify TEI.2 documents or nested texts within a corpus. It is an unique code assigned to each text (nested, composite, unitary) with the aim of making it recognizable within the two corpora. The codes must also be different between the two corpora, so that there is no collision when a text from CREA passes to CORDE. These references are constructed by the design department of the corpora, and consist of the following parts: In CREA, the code for unitary or composite, but not journalistic, texts specifies the corpus name, the medium, the superfield and thematic area, and the number within the thematic area, followed by an optional second thematic area: CR.L.1.01.001 CR.L.1.01.001.2.01 The code of composite journalistic texts indicates the name of the corpus, the medium, the title and date of the publication, and the number within this title: CR.P.PA1995.001 The code for an analytic text within a composite journalistic text is like the last one, but with the addition of a number: CR.P.PA1995.001.001 In CORDE, the system is the same for journalistic texts, but not for the rest. This corpus specifies the name of the corpus, the age or period the text belongs to, the genre and subgenre, the number within the subgenre, and an optional second genre or sub genre: CO.M.11.A11.001 CO.M.11.A11.001.12.A12 2.3.2. Internal location references The second type of reference provides a code that serves to find a fragment of text within the corpus. Every part of the texts will have this information, so that any example or fragment extracted from the corpus can be recognizable for any researcher. These reference consists of the following data: In unitary or composite non-journalistic texts, this reference specifies the name of the text, a number assigned to this name, and the page of the example: ade001:56 El adefesio, by Rafael Alberti. cas001:80 Castilla, by Azorín. In composite journalistic texts, the code specifies the name of the publication, the year, and the number within the year: vo1995.001 (First number of 1995 included in the corpus of the newspaper La Voz de Galicia) Nested texts add always a number before the indication of page: vo1995.001:78 2.3.3. References of the TEI header Some parts of the TEI header have also a code. This is intended to be linked to the texts of the corpus they correspond to. The fact that the TEI header can be presented as an independent file makes necessary the introduction of a code to link the bibliographical information to the texts. These are the references contained within the TEI headers of the TEI.2 documents. Within the element teiHeader, the reference starts with "th", followed by the code of the text. Examples: thade001 thcas001 thvo1995.001 thvo1995.001.001 Within the element fileDesc, the reference starts with "th", followed by the code of the text. Examples: fdade001 fdcas001 fdvo1995.001 fdvo1995.001.001 Within the elementsourceDesc, the reference starts with "th", followed by the code of the text. Examples: sdade001 sdcas001 sdvo1995.001 sdvo1995.001.001 Within the elementencodingDesc, the reference starts with "th", followed by the code of the text. Examples: edade001 edcas001 edvo1995.001 edvo1995.001.001 Within the element teiHeader, the reference starts with "th", followed by the code of the text. This part is not yet developed, for no text has been revised for the moment. Examples: rdade001 rdcas001 rdvo1995.001 rdvo1995.001.001 This aspect will be developed below with more detail. 2.3.4. References of structural or non-structural textual markup Some elements that occur within the texts have also identifiers or number attributes: pbhandnote 3. Classification of the texts 3.1. Text typologies in CREA and CORDE There are several taxonomies that classify the texts of the two corpora in different parameters. The corpus CREA has three different taxonomies, called "crea", "medio" and "oral". The corpus CORDE has four: "corde", "medio", "modal" and "epoc". The taxonomy "crea" of the corpus CREA classify texts in superfields and thematic areas (see figure below). The taxonomy "corde" of the corpus CORDE classify also the text, but with other kind of criteria: genre and subgenre, instead of thematic area. The most important taxonomies are "crea and "corde". The taxonomy "oral" is not yet developed, and differentiates types of oral texts. All the three agree with the design principles of the corpora, since categories are the basis for sampling and organiza tion of the texts. Fig. 4. Example of the taxonomy "crea": Ciencias y tecnología Biología Veterinaria Ecología Tecnología Física Agricultura, ganadería, pesca Meteorología Redes de comunicación Geología Química Informática Ciencias sociales, creencias y pensamientos Religión Lingüística, Lenguaje Historia Sociología Literatura Memorias, testimonios Erotismo, sexología Psicología Ética Geografía Problemática social Civilización, etnología Antropología Mitología Folklore Educación Mujer Fig. 5. Example of the taxonomy "corde": Prosa Prosa lírica Prosa narrativa Prosa narrativa breve Relato breve tradicional Relato breve culto Prosa narrativa extensa Relato extenso novela y otras formas similares Relato extenso diálogo y miscelánea Otros other categories The other taxonomies differentiate less categories. Thus "medio" differentiates in CREA four categories, and " in CORDE only three; "modal" is exclusive of CORDE and opposes verse to prose; "epoc" is used only in CORDE, and differentiates Middle Ages, Gol den Age and Contemporary texts. The classification codes assigned to each text are declared in an element of the profileDesc of the TEI header, called textClass. This is way the categories are declared: Fig. 6. As the figure shows, the element textClass contains several catRef, each of them specified in scheme and target. These two attributes associate a value (target) to a particular taxonomy (scheme). There are other classification systems applied to the corpus, like the ISBN, the ISSN or de Spanish "Depósito Legal". These values are declared within the sourceDesc, in an element called idno. This element has an attribute of type, which specifies the category. Example: Fig. 7 Code ISBN Code ISSN Code Depósito Legal 3.2. Movement of texts from CREA to CORDE As it has been said before, CREA is a monitor corpus that passes periodically all the oldest texts to CORDE. The need of moving texts from one corpus to the other makes necessary to define a system that converts the old codes of the text into the new ones . The system we have designed consists of associating a CORDE code to each CREA text in an idno element, as the figure shows: Fig. 8 CORDE code assigned to a CREA text This assignment is made at the same time that the TEI header of the text is written. This makes sure that the movement of texts can be made automatically. The only thing that a program must do is to change the identifying reference of the text into the value this idno element used to have. The result is a text with a new identifying reference and a new idnovalue, which is the original identifying reference in the corpus CREA. 4. Bibliographical information included in the TEI header and indexed in the data base (COSMAS 2.0) 4.1. Elements of the TEI header structure defined in the TEI Guidelines 4.1.1. TEI header of the corpus There is some general information declared within the TEI header associated to the whole corpus. The main ones are parcelled out among the following elements:fileDesc, encodingDesc, and revisionDesc. The fileDesc indicates the title, the responsible, the edition number, the publication status and the extent of the corpus. The encodingDescinforms about the aims of the corpus as a project, the sampling principles, the editorial principles in correction, quotation, hyphenation, segmentation and interpretation, and finally about the taxonomies used to classify texts. The revisionDesc, which has not been developed yet, will inform about revisions of the corpus. 4.1.2. TEI header of each text Each text has its own TEI header, as a way to introduce some bibliographical information concerning the electronic texts and their source editions. The fileDesc includes information about the electronic text, such as title, edition number and extent, and also information about the source edition of the text, that is to say, the paper or electronic version of the text included in the corpus. In unitary texts, the source description informs about the title, the author, the editor, the publisher, the publication place and date, the year of the first edition, the pages it has, some classification codes, such as isbn, dl, ccorde (for CREA texts), and some data concerning the series a text can belong to. There is also a field for notes. In composite texts, there is a need of introducing several source descriptions, one referred to the whole composite text (monogr), and one for each nested text (analytic followed by a monogr element). ). Within each source description of a composite text, the data provided are more or less the same as for a unitary text, although there are some changes. The identifiers differentiate the several source descriptions and link them to the part of text the correspond to (see above). The encodingDesc declares the tags actually used in the text, shows the location references that this text will have, and describes the profile of the text, that is to say, some creation details, the language used in it, the classes the text belong to, acc ording to the taxonomies defined in the TEI header, the list of hands that take part in the text, and some other details if the text is spoken. The revisionDesc will describe the changes operated within the text, once the corpus is compiled and revision starts. 4.2. Elements added to the TEI header structure All the elements described follow the TEI scheme. However, it has been necessary to modify the Guidelines in certain points, in order to include some additional data which are considered important for the purposes of the corpus. First of all, some data over the origin, the country and the sex of the author, providing they are known, have been added by means of an idno element, within the sourceDesc, as next figure shows: Fig. 9 (Spanish or Latin-american??) Secondly, the date of nested texts is considered important, particularly when there is a long chronological distance between nested texts belonging to the same composite text: Fig. 10 <date>Date of the nested text</date> When it is preferable to classify nested texts by their own date, instead of taking the date of the composite text, then nested texts are treated as independent texts belonging to the same collection. Another unsolved problem in the TEI is the indexing of the date of creation of a text when it is the traditional reference date for it. The solution for CREA and CORDE is to interpret date n=1.0 as the date of the first edition or the date of creation of the text, depending on which the reference date is. This is the way to index only one original date (date n=1.0), and one date of the source edition (date n=x.y), which can be the same or not. Any other explanation can be made within the element creation or in note. 5. Types of structural information included in the texts The main problem of the structural markup of the texts is that there are very different types of text within the corpus. This is the reason why the DTD of these corpora do not chose only one possibility of structure, such as prose, verse, drama, but introduces elements from all of them, making many different combinations acceptable. This encoding scheme is similar to the one found in the TEI Lite, although it adds some elements not included in that DTD. The main structural division of a text is the unit divnumbered from 1 to 7 to indicate the division level. The element textmay have or not divisions, depending on its internal organization. It is necessary to adapt always to the original structure of the text. 6. Types of non-structural information included in the texts A text of CORDE can be in prose or in verse. It will select sor lunits depending on this condition. In CREA this problem does not exist, since there is no verse. Drama and spoken texts can also be found, which are quite special from the structural point of view. And, of course, there can be very different mixtures, like prose into verse or written parts in spoken texts. In these cases, a particular division of a text changes the elements used in other divisions to respond to the requirements of the new text. As it can be seen, text modalities are not considered as watertight compartments. Basically, these are the main non structural elements found in the texts of the corpora: p and s in prose texts; l in verse texts; u and s in spoken texts ; sp, p and s in drama texts. Apart from these elements, there are other kinds of non-structural markup within the texts. First of all, there are some elements used to separate highlighted parts of a text from the rest of the corpus. There have been some changes in this point, since at the beginning all the highlighted expressions used to be interpreted, and now some of them are only being treated as emphatic (emph) elements. This change has been motivated by the need of finishing the first stage of these corpora by the end of 1997. In the next stage, some of the emph elements will be differentiated according to the previous encoding proposal. The categories for highlighted text distinguished now are the following: cit quote emph q Other SGML and TEI elements found in the texts of the corpora are these: abbr note and anchor list table formula caption corr sic= sic add del restore gap supplied It has been necessary to change the TEI scheme in the encoding of some tables, since it resulted complicated and slow. Some tools convert automatically tables into the TEI scheme. However, it is very common to find complicated tables, specially in old texts, that do not adapt to the typical grid. The correct markup of this special cases requires too much time and efforts. Consequently, a little change has been introduced: the element table can have plain text as content. Apart from the tags we have described, there are others that can be found only in certain types of text. Thus, in spoken texts, there are these special tags: u, for each utterance; pause, for pauses; vocal, for expressions that are not lexical units, but communicate something; kinesic, for gestures or movements of the participant; event, for noises or other non communicative events; In dramatic texts, there can be also a cast list before the play itself. This unit requires tags such as castList, castItem, castGroup, role, and roleDesc. In the text of the play, elements such as sp, speaker and stageare very common. 7. Conversion of the text to a plain text format 7.1. Treatment of hard, soft and end-of-line hyphens In the texts of the corpora, end-of-line hyphens are suppressed. Accordingly, a line like this one: Fig. 11 ¿Habrá en este comentario una crítica velada a mi apariencia?, pensó Onofre Bouvila al oír lo que decía el señor Braulio. Aunque la actitud cordial del fondista parecía desmentir esta suposici&oacut e;n, la susceptibilidad de Onofre Bouvila estaba plena- mente justificada will be encoded like this: Fig. 12 ¿Habrá en este comentario una crítica velada a mi apariencia?, pensó Onofre Bouvila al oír lo que decía el señor Braulio. Aunque la actitud cordial del fondista parecía desmentir esta s uposición, la susceptibilidad de Onofre Bouvila estaba plenamente justificada. Hard hyphens such as the ones found in composite words, intervals or dates, are preserved as normal hyphens. Therefore, expressions like the following will appear in the corpus this way: Fig. 13 Composite words: Original text: El vestíbulo era pequeño: sólo cabían allí un mostrador de madera clara con su escribanía de latón y su libro-registro. Encoded text: El vestíbulo era pequeño: sólo cabían allí un mostrador de madera clara con su escribanía de latón y su libro-registro. Dates: 2-5-95 Intervals: pages 2-20 Soft hyphens used in direct speech or thinking are replaced by the tag q . Example: Fig. 14 Original text: Es este barrio ruin lo que nos obliga a poner unos precios muy por debajo de la categoría del establecimiento -se lamentó. Fig. 15 Encoded text: Es este barrio ruin lo que nos obliga a poner unos precios muy por debajo de la categoría del establecimiento se lamentó. There are also soft hyphens that correspond to items of lists. These ones are replaced by the tag item. The rest of the soft hyphens, which indicate the beginning of a parenthetical comment, are replaced by low hyphens followed by space, so that they can be distinguished from hard hyphens. 7.2. Treatment of quotation marks, italics, bold face, small capitals, capitals or underlined characters Any highlighted piece of text will be encoded with the tag emph, unless it is a quotation (quote, cit), or direct speech (q). This implies that the quotation marks will be suppressed, since the information is preserved by other means. No difference will be made between types of rendition. There is an attribute available for this purpose, but it requires manual intervention to some extent, since there are several typographical combinations. Examples: Rosa, sueño de nadie bajo tantos párpados, escribe Rilke. Todo joven es un parvenu de la fisiología. Me interesa menos el habla del conjunto de la población que lo que podríamos llamar escribidura particular de ese pequeño sector de hombres públicos. Todos han venido esta tarde There is a program that introduces the tag emph where there is any special typographical rendition. The manual correction of the text allows the replacement of emph by quote, cit or q whenever necessary. As it will be shown below, two copies of the text are saved: an already revised copy of the text is saved in a non ISO 646 character set, without SGML markup, and a copy that will make part of the corpus is saved as a 7 bit text fully encoded. 7.3. Treatment of characters not included in the set ISO 646 Characters non belonging to ISO 646 are not valid for interchange. For this reason, it is necessary to convert them into another format. The standard writing system declaration useful for Spanish is the entity set known as "Latin-1", described in ISO 8859-1. An automatic character conversion is always made, in the corpora, from the special Spanish orthographical signs to the SGML standard entities. This process operates once the text has been revised. 8. Technical issues: mechanism to introduce markup in the texts: people, hardware, software. The CREA and CORDE research group have 14 people working in the department of introduction of texts. This section is charged with the introduction and encoding of the texts of the corpus. The compiling process has several stages. First of all, there is a conversion of medium, from paper to electronic format, by means of an OCR. The second stage is an automatic introduction of s, p, pb and emph tags. The third step requires more human than machine work, since it is the revision and correction of the texts after the first two processes. After the correction, all the non-ASCII characters are automatically converted into ISO 8859-1, so that the text can already be exported to an SGML editor, and interpreted according to the DTD of the corpus. In the fifth stage, new bibliographic, structural and non structural markup is introduced within the SGML editor. Once this is finished, the SGML text is validated and stored. After this process, there is an automatic tagging of the texts, which is followed by a new validation. Syntactical analysis will be added with a parser, just before the last validation of the text. 9. Conclusion This paper has tried to show that the TEI scheme results very suitable to encode large amounts of electronic text, like in the case of the Spanish corpora. The TEI provides encoding solutions for many different types of application, but it is almost impossible to use the whole tag set in a particular text or collection of texts. The use of the TEI requires a thorough analysis of its principles and a selection of a reduced tag set, according to the purposes of the text that is going to be encoded. Some aspects of the encoding principles of the Spanish large corpora have been described, such as the structure of the textual documents, the classification and referential systems, the bibliographical informations stored and processed, the internal tags, the movement of texts from one corpus to the other, and the conversion of texts to 7 bit ASCII format. All the solutions to encoding problems have followed TEI principles to some extent, although some of them are not treated in the TEI Guidelines. It has been necessary to develop some tools to make the markup process easier. All the tags that can be reduced to formal rules are being introduced automatically. The selection of tags should always take into account the cost in time and efforts of the introduction of tags in the texts. A balance between information retrieval possibilities and markup efforts should be found. Similarly, a good SGML parser and data base should be used to edit, store and retrieve encoded information. If all these conditions are fulfilled, TEI scheme results a good standard basis for the edition and interchange of electronic texts. 10. References Bryan, M. (1988). SGML: An author's guide. New York: Addison-Wesley. Burnage, G., Dunlop, D. (1993). "Encoding the British National Corpus",in J. Aarts, P. de Haan and N. Oostdijk (eds.), English Language Corpora: Design, Analysis and Exploitation, Amsterdam: Rodopi. Burnard, L., Sperberg-McQueen, C. M. (1995).TEI Lite: An Introduction to Text Encoding for Interchange.Document No: TEI U 5, Groningen: Groningen University. Burnard, L. (1992). "The Text Encoding Initiative: a progress report",in G. Leitner (ed.), New Directions in English Language Corpora. Methodology, Results, Software Developments, Berlin: Mouton de Gruyter. Burnard, L. (1995). "The Text Encoding Initiative: an overview", in Leech, G., Thomas, J. (eds.), Spoken English on Computer: Transcription, Markup and Applications, Harlow: Longman. Burnard, L. (1987). "CAFS: a new solution to an old problem", in W. Meijs (ed.), Corpus Linguistics and Beyond. Proceedings on the Seventh International Conference on English Language Research on Computerized Corpora, Amsterdam: Rodopi. Cover, R. (1991). "The progress of SGML (Standard Generalized Markup Language): extracts from a comprehensive bibliography"Literary & Linguistic Computing, 6/3, 197-209. Goldfarb, C.F. (1990). The SGML HandbookOxford: Clarendon Press. Ide, N., Véronis, J. (1994). "Corpus Encoding",EAGLES Document EAG-CSG/IR-T2.1 in EAGLES Interim Report. Ide, N., Véronis, J. (1995). The Text Encoding Initiative: Background and contexts.Computers and the Humanities, 29, 1-3. Johansson, S. (1994). "Continuity and change in the encoding of computer corpora", in N. Oostdijk, P. De Haan (eds.), Corpus-based Research into Language, Amsterdam/Atlanta: Rodopi. Johansson, S. (1993). "Some aspects of the recommendations of the Text Encoding Initiative, with special reference to the encoding of language corpora", in M. Kyt, M. Rissanen, S. Wright (eds.), Corpora Across the Centuries. Proceedings of the First Inter national Colloquium on English Diachronic Corpora, 25-27 March 1993. Pino, M. (1996)Manual de codificación textual para los corpus CREA y CORDE. Normas de marcación en SGML según las recomendaciones de la TEI. Versión 1.0.Internal document of the Lexicographic Institute of the Spanish Royal Aca demy. Pino, M. (1996)'Document Type Definition' para los corpus CREA y CORDE. Versión 1.0.Internal document of the Lexicographic Institute of the Spanish Royal Academy. Sperberg-McQueen, C.M., Burnard, L. (eds.) (1994). Guidelines for Electronic Text Encoding and Interchange. TEI-P3.Chicago / Oxford: Text Encoding Initiative. Van Herwijnen, E. (1994) Practical SGML.Boston: Kluwer Academic Publishers