SGML/XML: Academic Applications. Contents.
- TEI: Text Encoding Initiative
- Oxford Text Archive (OTA)
- MARC (MAchine Readable Cataloging) and SGML/XML
- University of Virginia Electronic Text Center
- The Electronic Archive of Early American Fiction (UVA)
- University of Michigan - Humanities Text Initiative (HTI)
- The HTI American Verse Project
- HTI - Middle English Compendium
- Making of America (MOA) Project - University of Michigan and Cornell University
- Model Editions Partnership: Historical Editions in the Digital Age
- American Memory Project, Library of Congress
- Brown University Scholarly Technology Group (STG)
- The Brown University Women Writers Project
- Midrash Pirqe Rabbi Eliezer Electronic Text Editing Project
- University of Cincinnati College of Law, Center for Electronic Text in the Law
- British National Corpus Project (BNC)
- Linguistic Data Consortium (LDC)
- IATH - Institute for Advanced Technology in the Humanities, University of Virginia at Charlottesville
- IATH: Piers Plowman Database (Hoyt Duggan)
- IATH: Rossetti Archive (Jerome J. McGann)
- IATH: William Blake Archive
- National Institute of Japanese Literature
- Japanese Text Initiative (University of Virginia and the University of Pittsburgh)
- CETH: Center for Electronic Texts in the Humanities
- Electronic Text Centre (ETC), University of New Brunswick Libraries
- Les Presses de l'Université de Montréal
- The Canterbury Tales Project
- University of Pittsburgh Electronic Text Project
- Georgetown University: Labyrinth Medieval Studies and Peirce Projects
- Project Opéra (Outils pour les documents électroniques, recherche et applications)
- MULTEXT (Multilingual Text Tools and Corpora) and MULTEXT-EAST (Multilingual Texts and Corpora for Eastern and Central European Languages)
- EAGLES Initiative (Expert Advisory Group for Language Engineering Standards)
- Corpus Encoding Standard (CES)
- European Corpus Initiative (ECI)
- Centro Ricerche Informatica e Letteratura (CRILet)
- Thesaurus Musicarum Italicarum
- Language Technology Group (LTG), Human Communication Research Centre (HCRC), University of Edinburgh
- The HCRC Map Task Corpus
- Lingua Parallel Concordancing Project
- University of Waterloo Centre for the New OED and Text Research
- University of Waterloo English Department - Technical Writing Course Using SGML
- Indiana University: LETRS Services
- Indiana University: Victorian Women Writers Project
- Encoded Archival Description (EAD) and Finding Aids Projects
- Berkeley Digital Library SunSITE [Formerly: Berkeley Finding Aid Project]
- Berkeley Art Museum/Pacific Film Archive
- Electronic Binding Project (EBIND) - UC Berkeley Digital Page Imaging and SGML
- American Heritage Virtual Archive Project
- Yale University Library EAD Finding Aids Project
- University of Iowa Library, Iowa Women's Archives
- California Heritage Digital Image Access Project
- Harvard/Radcliffe Digital Finding Aids Project (DFAP)
- Research Libraries Group (RLG) FAST Track: Finding Aids SGML Training
- Cheshire II Project and SGML (UC Berkeley)
- UCSD Archives Finding Aids Database UC, San Diego)
- Durham University Library - EAD Finding Aids
- Consortium for Interchange of Museum Information (CIMI)
- Project AQUARELLE
- Project Silfide (Serveur Interactif pour la Langue Française, son Identité, sa Diffusion et son Étude)
- Network of Literary Archives (NOLA)
- Duke University: Special Collections Library, SGML Finding Aids
- University of Warwick Modern Records Centre, Finding Aids Project
- Project ELSA (Electronic Library SGML Applications)
- UCLA - InfoUCLA Project, including ICADD SGML
- Representative Poetry Project: University of Toronto
- Stanford University - Academic Information Resources (AIR) and Academic Text Service (ATS)
- Project PREMIUM (PRoduction of Electronic Materials through International and Uniform Methods)
- CELT (Corpus of Electronic Texts) - University College Cork [was:CURIA Project]
- Chadwyck-Healey: English Poetry Full-Text Database, Patrologia Latina [other full-text databases]
- Cambridge University Press Electronic Editions
- The Electronic Arden Shakespeare: Texts and Sources for Shakespeare Studies
- University of North Carolina at Chapel Hill. Documenting The American South: The Southern Experience in 19th Century America
- English-Norwegian Parallel Corpus Project
- ETAP - Uppsala University Parallel Corpus Project
- Electronic Thesis and Dissertation Project
- Electronic Theses and Dissertations: Additional Materials
- Princeton University: The Charrette Project
- RIDDLE Project - Rapid Information Display and Dissemination in a Library Environment
- Department of Computer Science and Information Systems University of Jyväskylä
- Katholieke Universiteit Leuven - Document Architectures Research Unit
- Electronic Thesaurus Linguae Latinae
- University of Helsinki - Document Management Research Group
- Project ELVYN: Implementing an Electronic Version of a Journal
- HyperLib: Hypertext Interfaces to Library Information Systems
- Electronic New Testament Manuscript Project
- UMI (University Microfilms International)
- Perseus Project
- University of Bergen (Wittgenstein Archives)
- The Orlando Project: An Integrated History of Women's Writing in the British Isles
- British Women Romantic Poets Project
- SBL Seminar on Electronic Standards for Biblical Language Texts
- Hebrew Syntax Encoding Initiative
- Archivio Testuale Multimediale (ARTEM) Project
- Université de Montréal (EBSI-GRDS)
- GATE (General Architecture for Text Engineering) Project [Sheffield]
- The LEGEBIDUNA Project (Universidad de Deusto)
- SETIS: Electronic Texts at the University of Sydney Library
- OUCS
- Falch Research Projects
TEI: Text Encoding Initiative
[CR: 20010413] [Table of Contents]
Description
[June 30, 1999] On the XML version of the TEI DTD and TEI events after 1999-06, see "Text Encoding Initiative (TEI) - XML for TEI Lite."
The TEI (Text Encoding Initiative) has developed an SGML encoding for a wide range of document types in the domain of humanities computing. The Text Encoding Initiative is an international research project sponsored by the Association for Computing in the Humanities (ACH), the Association for Literary and Linguistic Computing (ALLC), and the Association for Computational Linguistics (ACL). Funding has been provided in part by the US National Endowment for the Humanities, Directorate XIII of the Commission of the European Communites, the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada. The TEI ("P3") Guidelines were published in May 1994, after six years of development involving many hundreds of scholars from different academic disciplines worldwide. They are available in print copy, in searchable/linked format on CDROM (see also the anouncement), or on the Internet in plain text format. An overview of the TEI's origins and goals is given in "Text Encoding for Information Interchange. An Introduction to the Text Encoding Initiative" (TEI Document no TEI J31, by Lou Burnard, July 1995). See also: "An Introduction to the Text Encoding Initiative" (TEI EDW26, by Lou Burnard), available on the OTA FTP server, from the UIVCM Listserver, or from the SIL WWW server.
The authoritative FTP site for TEI DTDs, Writing System declarations, and documentation is TEI: FTP to UIC. The TEI P3 DTDs are also stored on the OTA FTP server. Encoding guidelines have been published as Guidelines for Electronic Text Encoding and Interchange. TEI P3, May 1994, edited by Michael Sperberg-McQueen and Lou Burnard. See the bibliographic reference for full details. The current draft of the Guidelines is thus sometimes identified as "P3" ("P2" and "P1" represent earlier drafts). Instructions for ordering the P3 Guidelines are available here, or by requesting the file 'P3ORDER DOC' from the UICVM LISTSERVer, using syntax described below. The TEI FTP server also contains a number of resources relating to the production and maintenance of the TEI Guidelines: TEI working papers, TEI organization & personnel, TEI introductions & tutorials, TEILITE introduction & DTDs, Model Editions Partnership (a TEI application), Explanation of TEI tagset for producing TEI P3, Information on TEI P1, P2, and P3, TEI & SGML resources, etc. Information on the literate programming style used to produce the TEI P3 DTDs and documentation is found in the subdirectory ftp://ftp-tei.uic.edu/pub/tei/odd/; [local archive copy].
Chapter 2 of the TEI Guidelines "A Gentle Introduction to SGML" is one of the best SGML introductions. It is available from the UIC TEI Web server, or from Oxford: http://sable.ox.ac.uk/ota/teip3sg/. It was translated into Russian by Boris Tobotras: HTML or SGML format, [local archive copy].
Canonical files for the Guidelines and other TEI research documents may also be obtained through the official mail server: listserv@listserv.brown.edu. To get a complete file listing of TEI materials, send the command INDEX TEI-L in the body of an email message to the LISTSERVer at this address. To subscribe to the TEI-L discussion forum, send the command SUBSCRIBE TEI-L YOUR-NAME to the LISTSERVer (where 'YOUR-NAME' is your personal name).
Courses and seminars on TEI encoding are offered periodically at various universities. For 1997, see the announcement from Lou Burnard for TESS: The Text Encoding Summer School, sponsored by The Humanities Computing Unit at Oxford. The course will be held at Oxford University, 8 - 11 July, 1997.
Conference entry for the TEI 10th Anniversary User Conference, November 14 - 16, 1997. Brown University, Providence, Rhode Island, USA. See: General Information
TEI and the MLA. See the announcement from Charles Faulhaber (University of California, Berkeley) for the publication of the MLA (Modern Language Assocation of America) draft Guidelines for Electronic Scholarly Editions. Highlights from the : "B. Encoding norms. It is preferable to use the implementation of Standard Generalized Markup Language (SGML) specifically devised for coding electronic texts, the Text Encoding Initiative (TEI). The choice of an alternate standard should be fully justified and explained. C. The text itself should be essentially self-describing, which means that the computer file which embodies it should contain a header with essential meta-data. The Guidelines for Electronic Text Encoding and Interchange (TEI P3), edited by C.M. Sperberg-McQueen and Lou Burnard (1994) offer detailed descriptions of the sorts of information that should be provided for the source document as well as the electronic text itself." See: Guidelines for Electronic Scholarly Editions; [archive copy, August 12, 1997]
TEI Monographs and Journal Special Issues. Several monographs and journal special issues have been dedicated to the Text Encoding Initiative's encoding guidelines. See, for example:
- See [the bibliographic reference for]: Ide, Nancy; Véronis, Jean, (editors, with a volume preface by Charles F. Goldfarb and volume bibliography by Robin C. Cover). The Text Encoding Initiative: Background and Context. Dordrecht, Netherlands: Kluwer Academic Publishers, [August] 1995. Extent: vi + 242 pages. ISBN: 0-7923-3689-5 (hardbound); 0-7923-3704-2 (paperback). Also published as a three-part special issue of CHUM.
- TEXT Technology special issue, edited by . Details: Electronic Texts and the Text Encoding Initiative. A Special Issue [5.3] of 'TEXT Technology: The Journal of Computer Text Processing. Madison, SD: College of Liberal Arts, Dakota State University, [George M. and Merrill D. Hunter Electronic Publishing Center], Autumn, 1995. ISSN: 1053-900X.
- Announcement from Nancy Ide for a special issue of Cahiers GUTenberg dedicated to the Text Encoding Initiative. Number 24 of Cahiers GUTenberg is a 251-page issue containing eleven articles on TEI, all in French. The full text of the issue is now available at the following web site: http://www.univ-rennes1.fr/pub/GUTenberg/publications/. Table of Contents, local archive copy.
- Bibliographic entry for the volume edited by Daniel I. Greenstein. Modelling Historical Data: Towards a Standard for Encoding and Exchanging Machine-Readable Texts. Halbgraue Reihe zur Historischen Fachinformatik, Serie A, Historische Quellenkunden, edited by Manfred Thaller, Band (A) 11. St. Katharinen: [Published for the Max-Planck-Institut für Geschiche, Göttingen by] Scripta Mercaturae Verlag, 1991. Extent: iv + 223 pages. ISBN: 3-928134-45-0.
TEI Lite
TEI Lite is a subset of the full TEI DTD. Researchers who may have been put off initially by the TEI's elaborate use of parameter entities and driver files (necessary to initialize a full TEI setup) should return now [July 1995] to have a look at the much simpler TEI Lite. It is a "small but usable subset of the TEI main DTD" that avoids some of the complexities in full TEI DTD. The documentation for TEI Lite (in SGML and HTML format) is superb, making the TEI accessible to a much wider audience. Whereas the official P3 reference manual weighs in at 1200 pages, TEI Lite is nicely presented in a 200K HTML document.
From the document which describes TEI Lite, cited below: "This document provides an introduction to the recommendations of the Text Encoding Initiative (TEI), by describing a manageable subset of the full TEI encoding scheme. The scheme documented here can be used to encode a wide variety of commonly encountered textual features, in such a way as to maximize the usability of electronic transcriptions and to facilitate their interchange among scholars using different computer systems. It is also fully compatible with the full TEI scheme, as defined by TEI document P3, Guidelines for Electronic Text Encoding and Interchange, published in Chicago and Oxford in May 1994."
- See: TEI Tutorials and Introductory Materials.
- "TEI Lite: An Introduction to Text Encoding for Interchange"
- TEI FPIs - URL for the TEI DTD public identifiers [ftp://ftp-tei.uic.edu/pub/tei/p3/dtd/fpi/]. The main driver file is: "-//TEI P3//DTD Main Document Type 1995-09//EN".
- (Provisional, May 1996): set of TEI FPIs (formal public identifiers). See also the announcement; or mirror collection
- TEI Lite DTD from UIC (USA)
- TEI Lite DTD from OTA (UK)
- "Bare Bones TEI: A Very Very Small Subset of the TEI Encoding Scheme"
- TEI Lite Inroduction in SGML format
- Database entry for the Text Encoding Initiative (TEI) in the XML section
- Patrice Bonhomme described a private "XML release of the TEI Lite DTD" on December 01, 1997. "...not an official release of the TEI Lite [XML DTD]. A lot of things remains to be done. . ."
TEI: Primary WWW/FTP Sites
[CR: 19970207]
- The authoritative WWW site is http://www.uic.edu/orgs/tei/
- The authoritative FTP site is TEI: FTP to UIC
- A mirror site is: TEI: FTP to Exeter
- A mirror site is: TEI: FTP to Oslo
- A mirror site is: TEI: FTP to Chiba Univ.
- Provisional list of [possible] additional pointers from the BNC Project, TEI Page
- TEI-L and TEI-TECH shadow archive, (2244 messages about 1328 subjects). Created by Arjan Loeffen.
Useful links and prettier views of TEI documents
- TEI Application Page: Project Descriptions [added August 16, 1996]
- "TEI Lite" (smaller model): see or ftp://ftp-tei.uic.edu/pub/tei/lite/lite14.dtd or ftp://ota.ox.ac.uk/pub/ota/TEI/dtd/teilite.dtd.
- Interactively search and browse the Guidelines for Electronic Text Encoding and Interchange (TEI P3), supported by the Humanities Text Intiative, University of Michigan. "The revised implementation provides slightly more elegent browsing capabilities and filtering of the text from SGML to HTML. A major feature has been added -- the ability to quickly lookup in Part 7 the description of an element, a parameter entity, or a element class. Links to the elements a particular tag may occur within or contain are provided as links at the bottom of the description. Other searches, including boolean and proximity, are also available."
- TEI: Browse P3 at UVA (Guidelines for Electronic Text Encoding and Interchange)
- TEI: TEI DTD (CETH's WWW server)
- TEI P3 filelist in HTML by Lars Aronsson
- A searchable version of P3 provided by Electronic Book Technologies
- TEI WAIS Search: Text Encoding Initiative (or perhaps: TEI WAIS Search), by Janne Himanka, University of Oulu
- TEI: P3 Files (Index of /TEI/)
- TEI (Information Sheet, CURIA)
- TEI Extended Pointers (Links) tutorial, by Lou Burnard. See also chapter 14 of the TEI Guidelines
- Guidelines for Text Mark-up at the (UVA) Electronic Text Center -- Based upon TEI SGML, and TEI-Lite
- "Textual Criticism and the Text Encoding Initiative"C. M. Sperberg-McQueen, Modern Language Association 1994, session sponsored by the Emerging Technologies Committee. [mirror copy]
- "The Description of Electronic Texts: The Text Encoding Initiative and SGML", Susan Hockey, Presented at Library of Congress Seminar on Cataloguing Digital Documents, October 12-14, 1994. [mirror copy, text only]
- SGML and TEI Resources (CETH)
- CETH Humanities Electronic Resources Center TEI Pilot Projects
- XML Version of the TEI DTD: On April 07, 1998, Lou Burnard wrote: "The TEI has recently chartered a workgroup on architectural issues (chaired by Frank Tompa), and one of its specific charges is precisely the development of an XML version of the full TEI dtd." [TEI-L posting] See also the preliminary unofficial work on an XML version of TEI Lite DTD.
TEI DTD and SoftQuad's Author/Editor
[CR: 19970617]
For handling of the TEI DTD subsets referenced by parameter entities:
- For use with RulesBuilder or mkrls (the batch version of Rulesbuilder) with the TEI DTD - from Liam Quin [June 1997]
- P3 (February 1995) modified for use with RulesBuilder - ftp://ftp.pitt.edu/dept/slavic/sgml/tei2rb.dtd (David J. Birnbaum); [mirror copy]
- Instructions for RB on the TEI site; see also TEI-L Discussion on Normalizing the TEI DTD
- Hints from Arjan Loeffen (from CTS)
- Hints from Peter Flynn
- Hints from Gregory J. Murphy
- Hints from Lou Burnard [June 1997: "use Richard Light's normalizer, NORMDTD"]
- Hints from Richard Light
- Hints from Carol Mah
TEI (Lite) DTD and Panorama [version 1.5], with Netscape
- Help fril from CETH [mirror copy, September 1996]
- Posting by Wendell Piez, CETH
- See similarly: Notes on Panorama for users of the EAD DTD [principles applicable to other complex DTDs; mirror copy]
- See also TEI-L Discussion on Normalizing the TEI DTD
TEI DTD and WordPerfect (SGML Edition)
[CR: 19970718]
TEI and Other Software
[CR: 19981216]
In addition to the comments on TEI DTD configuration with specific software products (above), note the following TEI support tools and facilities:
- "Ebenezer's software suite for TEI." See the announcement from Kevin Russell (Linguistics, University of Manitoba) for package of files and installation instructions for esiting TEI documents with Emacs and PSGML. See the URL: http://www.umanitoba.ca/faculties/arts/linguistics/russell/ebenezer.htm. The package, entitled "Ebenezer's software suite for TEI," includes "the program files for Emacs, Lennart Staflin's PSGML package, James Clark's Jade engine and SP parser, the official files for the TEI (DTDs, entity files, WSDs), the catalogue files for making all of the above run hopefully transparently, and a short tutorial."
- [October 09, 1998] Apropos of managing DTD fragments, designing modularized DTDs, DTD subsetting, namespaces, (etc.), readers will be interested to survey Lou Burnard's Web page entitled The Pizza Chef: a TEI Tag Set Selector, recently referenced in an announcement. Lou Burnard (European editor for the Text Encoding Initiative Guidelines) has created the tool to help users design their own TEI-conformant document type definition. The TEI DTD itself is very large, but its modular construction and heavy use of 'classes' (defined in parameter entities) allow the user to select desired tag sets for a project and thus 'make up their very own view of the TEI DTD, including their own modifications and restrictions.' The Pizza Chef tool "allows you to select the TEI tagsets you want from a menu, and also to pick out individual elements for inclusion, exclusion, or modification. You can then download a customized DTD subset, or a completely compiled (i.e., non parameterized) DTD for use by e.g., SoftQuad's Rulesbuilder." Another strategy for subsetting large (complex, overly-general) DTDs uses architectural processing; see the abstract for the paper to be presented by Gary Simons at the November Markup Technologies '98 Conference
- [December 09, 1997] TEItools from Boris Tobotras, as described in a posting to TEI-L. "TEItools denotes my collection of scripts for transforming documents written in SGML to various output format. I'm in process of writing it now, and currently it is able to produce HTML, LaTeX2e, RTF, PS and PDF." See also the TEItools user guide (under development) and the local database entry for TEItools.
- Possibly useful SGML Open CATALOG for use with the TEI DTD and psgml/emacs [nsgmls], supplied by David Birnbaum. In this connection, note the URL for the TEI DTD public identifiers (FPIs), and the documentation.
- tei2latex, tei2html: [October 23, 1997] Announcement from Jean-Daniel Fekete (Ecole des Mines de Nantes) for the availability of TEI2LATEX and TEI2HTML version 0.2. - 'Two Perl5 Programs to Translate TEI Lite Documents into LaTeX2e and HTML documents .' TEI2HTML can now split a TEI Lite document into several linked html subdocuments. See the main entry for tei2latex: TEILITE to LaTeX2e, or FTP: ftp://ftp.emn.fr/incoming/fekete/tei2latex-0.2.tar.gz. [Previously: Announcement from Jean-Daniel Fekete (Universite de Paris-Sud) for tei2latex version 0.1: "tei2latex is a Perl5 Program to Translate TEI Lite Documents into LaTeX2e documents...See also announcement on TEI site.]
- FTP tei2latex (version 0.1c, July 25, 1996; [archive copy]
Oxford Text Archive (OTA)
[CR: 19961022] [Table of Contents]
For some twenty years the Oxford Text Archive has been collecting electronic texts, and has sponsored extensive research involving the use of SGML in an academic setting. The Archive is "a facility provided by Oxford University Computing Services and forms part of the Humanities Computing Unit . . . serving the interests of the academic community by providing low-cost archival and dissemination facilities for electronic texts."
"The Archive contains electronic versions of literary works by many major authors in Greek, Latin, English and a dozen or more other languages. It contains collections and corpora of unpublished materials prepared by field workers in linguistics. It contains electronic versions of some standard reference works. It has copies of texts and corpora prepared by individual scholars and major research projects worldwide. The total size of the Archive exceeds a gigabyte and there are over 2000 titles in its catalogue." [from the "General Information" page]
"All texts which are publicly available from the Archive's FTP server are first converted to a standard format. This format conforms to the recommendations of the Text Encoding Initiative (TEI), and is therefore an application of ISO 8879, Standard Generalized Mark Up Language (SGML). A catalog of electronic texts in the Archive is available in SGML format. OTA is also the authoritative FTP site for a significant corpus of literary texts encoded in (TEI) SGML by members of the Oxford Text Archive project and by others. An October 1996 snapshot of the OTA file listings provided here illustrates the range of texts in SGML format available to the academic public via FTP; (compare: snapshot of date: December 2, 1994).
Links:
- Home Page[alias: http://www-tei.uic.edu/orgs/tei/app/ox01.html]
- Link on the TEI Project Descriptions Page
- Text Archive Shortlist [mirror copy]
- also the mirroring of texts in the Oxford Text Archive at UMich (announcement)
- Ordering a text from the Oxford Text Archive
- Information on the TEI and SGML from the Oxford Text Archive
- The Oxford Text Archive public ftp service
- FTP: ftp://ota.ox.ac.uk/pub/ota/public
Addresses:
The Oxford Text Archive
Oxford University Computing Services
13 Banbury Road
Oxford OX2 6NN
UK
Tel: +44 01865 273238
FAX: +44 01865 273275
Email: archive@sable.ox.ac.uk
University of Virginia Electronic Text Center
[CR: 19980104] [Table of Contents]
The University of Virginia has pioneered a number of highly successful uses of (TEI) SGML in delivering online electronic texts, including structured-text searches. "Since 1992, the Electronic Text Center at the University of Virginia has combined an on-line archive of thousands of SGML-encoded electronic texts (some of which are publicly available) with a library-based Center housing hardware and software suitable for the creation and analysis of text. Through ongoing training sessions and support of individual teaching and research projects, the Center is building a diverse user community locally, and providing a model for similar enterprises at other institutions." The Text Center, in cooperation with the Bibliographical Society of the University of Virginia is making Studies in Bibliography [On-Line] freely accessible on the Internet, based upon TEI-SGML encoding of the (ca. 1000) articles.
Many of the texts in UVA's Electronic Text Center are indexed with Open Text's PAT search engine. Some of the materials are available only to UVA's institutional members (OED 2nd edition, English Poetry Full-Text Database, Patrologia Latina, Old English Corpus, Shakespeare, French and Latin Collections). Other online texts are searchable by any researchers via the Internet, including a Middle English corpus (see bibliography), Michigan Early Modern Materials, King James Bible, and others.
The Institute for Advanced Technology in the Humanities at the University of Virginia in Charlottesville uses SGML in many of its text projects, and has developed some SGML (aware) software in this connection.
See further explanation through exploration of the following links:
- Announcement: Access to SGML Textual Analysis Resources via Open Text's PAT search engine
- University of Virginia Library Web
- Announcement from David M. Seaman for the first release of a [UVa] searchable and browseable version of the EAD tag library: see the online EAD Tag Library
- Electronic Text Center Home Page -- University of Virginia
- The University of Virginia Electronic Text Library
- Announcement for Studies in Bibliography [On-Line] freely accessible on the Internet. See also the URL for the Bibliographical Society
- The Modern English Text Collection at UVA - [provisional link, February '95] Texts tagged in TEI SGML format, filtered from TEI-SGML to HTML "on-the-fly" as you request the text. Some 550 texts were online as of March 20, 1995. The Perl scripts were developed by Jeff Herrin and David Seaman; see Online Texts Available from Virginia explanatory notes.
- Middle English Search
- GOPHER- UVA, Etexts
- David Seaman (Interview)
- David Seaman: "Gateways..."
- Query the RSV Bible
- Query the KJV Bible
- The WWW-to-PAT Gateway
- TEI2MARC program
- John Price-Wilkin. "Using the World-Wide Web to Deliver Complex Documents..." Obtain via email from LISTSERV@UHUPVM1.UH.EDU ( GET PRICEWIL PRV5N3 F=MAIL ). Also on GOPHER at info.lib.uh.edu:70..
David Seaman, Coordinator 804-924-3230 (phone) Electronic Text Center 804-924-1431 (fax) Alderman Library email: etext@virginia.edu University of Virginia http://www.lib.virginia.edu/etext/ETC.html Charlottesville, Virginia 22903
The Electronic Archive of Early American Fiction (UVA)
[CR: 19980105] [Table of Contents]
The University of Virginia Library has received a grant from the Andrew W. Mellon Foundation for $400,000 for a two-year project (1996-1998) involving digitizing and delivering electronic texts of rare books. "Two versions of each text will be made available: a TEI-conformant SGML-tagged text and color images of the pages of the first editions--a total of 118,000 pages. The project will conclude in 1998 with an economic study of usage of the e-texts compared with usage of the original rare books."
The encoding uses TEI/SGML: "As the texts are created, standard SGML markup is added to record the physical and structural characteristics of the text: title-page layout, pagination, paragraphs, verse lines, italics, accented letters, etc." The parger project goal is "to create electronic texts of rare books and to compare the usage and costs of electronic texts and of original paper texts of rare books. As part of the study, 582 first editions of the most important novels and short stories will be digitized and put on the World Wide Web. . . The project will focus on e-texts of a well-defined and comprehensive collection of early American fiction derived from the two standard bibliographies of American fiction. Specific outcomes expected from the project are: (1) electronic texts and images on the World Wide Web of 582 seminal volumes in early American literature; (2) a model process, exportable to other libraries, for creating e-texts of rare books; (3) measurement and analysis of usage and costs of the e-texts and of the originals on which they are based; (4) two written reports: (a) presenting this project as a model for the creation of images and SGML-tagged ASCII texts of rare books in research libraries; (b) on the usage and costs of e-texts of rare books; (5) presentations of the results of this project at national or international conferences."
Links:
- Announcement, HTML format [mirror copy]
- Announcement for the Project, by David Seaman [text version]
- EAF Home Page
- The Electronic Archive of Early American Fiction (1775-1850) presented by David Seaman at ACH-ALLC '97
- From the Chronicle of Higher Education's Academe Today - Project Description
- "The Electronic Archive of Early American Fiction at the University of Virginia"; [local archive copy]
- The title list
- Project Proposal
- Producing SGML-Tagged ASCII Texts. Details on the SGML aspects of the EAF Project; [mirror copy]
- Authors Included in the American Fiction Project
Addresses:
Contact:
David Seaman
Tel: +1 (804) 924-3230
[See the main UVA entry for other address details]
University of Michigan - Humanities Text Initiative (HTI)
[CR: 19980821] [Table of Contents]
"The Humanities Text Initiative (HTI) is a project of the University of Michigan Libraries, the UM Press, and the School of Library and Information Studies, with support from the College of Literature, Science & Arts. Special thanks to ITD for providing bridge equipment. The HTI is responsible for creating and maintaining new textual collections, primarily in SGML. The initial focus of the project will be in Middle English materials and American verse, as well as recent publications of the UM Press. The HTI is also available to assist faculty and students using SGML and in particular the Text Encoding Initiative Guidelines for publishing. For more information or assistance, send e-mail to hti@umich.edu or call 761-4760." [from the HTI Home Page]
Under the direction of John Price-Wilkin, the Humanities Text Initiative at the University of Michigan is developing a set of online text resources, some of which employ SGML encoding as a basis for search and retrieval. Currently available to the public via WWW browser (as an interface to PAT): TEI Guidelines for Electronic Text Encoding and Interchange (P3), Middle English Collection, Revised Standard Version of the Bible, King James Version of the Bible, Michigan Early Modern English Works (16 MB of SGML-tagged text). The center also supports other reference material restricted to the University of Michigan: (OED 2nd edition, English Poetry Database, Old English Corpus, Migne's Patrologia Latina, Modern English Works. Many of the tools mirror (functionally) the resources at UVA, previously developed by John Price-Wilkin.
Texts structured in SGML are searchable by the PAT "SGML" software (from Open Text), and user interfaces to PAT are provided on the Internet using WWW forms and line-mode access. Short segments of text in a hit list reveal the SGML tags, but linking to the full text from the concordance (hit list) presents the document in a formatted appearance. Very subtle queries are possible: proximity specifiers, extended set of relational and Boolean operators, context-unit specifiers (for collocations), etc. Try these links:
- Humanities Text Initiative
- WWW-to-PAT Gateway (Umich version) -- see how the WWW-to-PAT gateway software is implemented using a Common Gateway Interface (CGI)
- Presidential Initiative Fund: "Beyond the Traditional Codex: A 'Collaboratory' for the Humanities"
- Recent PACS Review article on WWW and PAT; or obtain the article via GOPHER. Or, send the following e-mail message to listserv@uhupvm1.uh.edu: GET PRICEWIL PRV5N7 F=MAIL.
- Umich: Search TEI Guidelines for Electronic Text Encoding and Interchange. Note the improved searching and browsing interface, October 01, 1997.
- Middle English Collection
- Public Domain Modern English Search ("This is an experimental service providing dynamic access to a large and growing collection of texts in SGML using the TEI "pocket" DTD for markup. A preliminary stylesheet and navigator are also available for the TEI DTD.")
- Search Religious Texts at HTI
- Search the RSV (simple search)
- Search the RSV Bible (with Booleans, etc.)
- Search the KJV Bible
- Search the Book of Mormon (with Booleans, etc.)
- HTI SGML Server Program (SSP)[providing online support for selected SGML-encoded text and reference collections to other academic institutions]
- "Options for Presentation of Multilingual Text: Use of the Unicode Standard", by Janet C. Erickson (March 14, 1997); [mirror copy]
- See: http://dns.hti.umich.edu/htistaff/pubs/1997/ejshaw.01/: "OCR and SGML Mark-up of Documents from the Making of America Project. Report on a Directed Field Experience at Humanities Text Initiative." By Elizabeth Shaw, December, 1996; [mirror copy]
- Announcement for the HTI Middle English Compendium. See the main page.
The HTI American Verse Project
[CR: 19980115] [Table of Contents]
Summary: "The American Verse Project is a collaboration between the University of Michigan Humanities Text Initiative (HTI) and the University of Michigan Press. The project is assembling an electronic archive of volumes of American verse. Most of the archive is made up of 19th century poetry, although a few early 20th century texts are included. The full text of each volume is being converted into digital form and coded in Standard Generalized Mark-up Language (SGML) using the TEI Guidelines. . . The collection is made accessible in SGML, dynamically rendered HTML, and as a searchable database. As with all of the other Humanities Text Initiative resources, simple word and phrase searches are supported, as well as proximity searches, and searches for verses or paragraphs containing two or three phrases. The project uses an unusual model for rights for a project involving a University Press: most uses are without practical restrictions and cost, but the texts are available for sale to other publishers and agencies who wish to provide access to the texts from their own system."
". . .second goal of the project is to provide a service to scholars by advancing their ability to use Web documents in their work. Currently, the Internet does not have well-established mechanisms for authors seeking to integrate complete texts, or parts of texts, into their scholarship. The TEI Guidelines provide clearly defined ways of linking from one SGML document to portions of another; however, no one has yet set up a Web server to accept this sort of linking. The HTI proposes to explore this as part of the American Verse project. This will allow, for example, someone writing about Dickinson to embed links in his or her electronic text pointing the reader to various poems, stanzas, or lines from volumes that are part of the project without having to replicate the material within his or her own document as is currently the case. The evidence of scholarship would remain in this central archival server, rather than be replicated on a number of different scholars' machines."
Links:
- The original announcement, by John Price-Wilkin
- Updated announcement, February 1996
- [January 15, 1998] Announcement from Christina Powell for additions to the American Verse Collection: "The Humanities Text Initiative at the University of Michigan is pleased to announce the addition of 35 new texts to the American Verse Project. Works by little-known women and African-American authors not contained in other electronic text collections have been added, as have works by well-known authors such as Emily Dickinson.
- Announcement from Chris Powell of the Humanities Text Initiative at the University of Michigan for a major addition of fifteen new texts to the collection of SGML-encoded works in the American Verse Project. April 1997.
- HTI American Verse Project, Home Page
- HTI American Verse Project: Description
- Critical Applications of the HTI Amverican Verse Collection
- Browse American Verse Texts
- Email: hti-info@umich.edu
HTI - Middle English Compendium
[CR: 19980821] [Table of Contents]
The Middle English Compendium is a project under the direction of the University of Michigan Digital Library Production Service, funded by a grant from the National Endowment for the Humanities. "The Compendium provides access to and interconnectivity among three resources: an electronic version of the Middle English Dictionary, a HyperBibliography of Middle English prose and verse based on the MED bibliographies, and a full-text Corpus of Middle English Prose and Verse. The MED and the Corpus are encoded in SGML using the Text Encoding Initiative Guidelines. The first installment (currently online) includes 1,073 HyperBibliography entries covering 1,526 copies of Middle English texts, 15,940 MED entries covering M-U (more than one-third of the projected complete print MED), and 42 searchable texts in the Corpus."
References:
- MEC Project Description - [local archive copy]
- About the Corpus of Middle English Prose and Verse
- Announcement from Christina Powellfor the HTI Middle English Compendium.
Making of America (MOA) Project - University of Michigan and Cornell University
[CR: 20010309] [Table of Contents]
"Making of America (MOA) is a digital library of primary sources in American social history from the antebellum period through reconstruction. The collection is particularly strong in the subject areas of education, psychology, American history, sociology, religion, and science and technology. The collection contains approximately 1,600 books and 50,000 journal articles with 19th century imprints. The project represents a major collaborative endeavor in preservation and electronic access to historical texts." [from the Home Page]
The MOA supports SGML-based Access Systems: "We hope that users of the system will appreciate some of the functionality developed through UM's nearly eight years of experience with deploying SGML-based access and delivery systems. Attractive, easily navigated displays of results showing the number of occurrences per page are combined with displays of the page image, circumventing many of the problems encountered when relying on OCR alone. As we have opportunities to "clean up" and more richly encode OCR'd texts, the system will begin to show dynamically-rendered HTML with links to the page images. The mechanisms used for the MOA system will be provided to participants in the UM's SGML Server Program." [from the announcement].
References:
- See the XML DTD work: "The Making of America II Project."
- Making of America Project Home Page
- [March 29, 1999] Announcement from Maria S. Bonn on the 'revised' MOA facilities, Spring 1999.
- The announcement for Making of America Project, from John Price-Wilkin (March 19, 1997)
- [July 1997] Announcement from John Price-Wilkin for the addition of several hundred new volumes to the University of Michigan's Making of America site: http://www.umdl.umich.edu/moa/, "bringing to the total number of books to 1,402. That's an average of 258 pages per volume, and a total size of 742Mb of searchable text. This represents a significant body of materials for research, 85% of the size of the English Poetry Database, now accessible freely on the Internet. Nearly 200 more monographic titles will be added in the coming months, bringing the size of the monographic portion to nearly 1Gb.
- About Making of America Project
- Browse the MOA Document Collection
- Searching the MOA Database
- Advanced Searching of the MOA Database
- "Making of America. Online Searching and Page Presentation at the University of Michigan." By Elizabeth J. Shaw and Sarr Blumson. In D-Lib Magazine (July/August 1997). See especially the document section 'OCR and SGML Encoding': "automated processes were developed that (1) process the raw text files to remove non ASCII characters and clean up the text; (2) take bibliographic meta-data about the document contained in a file prepared by NMI and insert it into a TEI conformant header (see TEI Guidelines for Electronic Text Encoding and Interchange); (3) concatenate all of the document pages into a single SGML file that includes encoding that marks the content into gross divisions within front, body and back matter, page breaks and retains references to non-text images."
- Just-in-time Conversion, Just-in-case Collections. Effectively leveraging rich document formats for the WWW." By John Price-Wilkin [Head, Digital Library Production Service, University of Michigan.] In D-Lib Magazine (May 1997). "The University of Michigan's Digital Library Production Service (DLPS) has developed substantial experience with dynamic generation of Web-specific derivatives from non-HTML sources based on several key projects and consideration of how users work with key resources. This article is based on DLPS's experience and resultant policies and practices that guide present and future projects. The DLPS currently offers dozens of collections, including more than 2,000,000 pages of SGML-encoded text and more than 2,000,000 pages of material using TIFF page images.3 All of the material in these collections is offered through the WWW, and nearly all of it is presented in Web-accessible formats through real-time transformations of the source material."
- HTI SGML Server Program (SSP)[providing online support for selected SGML-encoded text and reference collections to other academic institutions]
- Cornell Digital Library (CDL)
- Making of America: Frequently Asked Questions , from Cornell; [mirror copy, March 1997]
- Help : How to Use MOA resources
- See also: http://dns.hti.umich.edu/htistaff/pubs/1997/ejshaw.01/: "OCR and SGML Mark-up of Documents from the Making of America Project. Report on a Directed Field Experience at Humanities Text Initiative." By Elizabeth Shaw, December, 1996; [mirror copy]
Addresses:
Making of America Project
University of Michigan Digital Library
Email: moa-feedback@umich.edu
Email: John Price-Wilkin
Model Editions Partnership: Historical Editions in the Digital Age
[CR: 19990902] [Table of Contents]
"Project Description: "The Model Editions Partnership is a consortium of seven historical editions which has joined forces with leaders of the Text Encoding Initiative and the Center for Electronic Text in the Humanities. The participants are now developing a prospectus setting forth editorial guidelines for publishing historical documents in electronic form. Later they will create a series of SGML demonstration models."
"Electronic editions should use standard non-proprietary formats (markup) for the representation of text, images, and other material. Standard formats, such as SGML for example, are essential if editions are to remain usable despite rapid changes in computer hardware and software. Publicly-controlled standards are essential if editions are to be used with a wide variety of hardware and software. International and national standards issued by recognized standards bodies should be preferred to de facto standards because such organizations guarantee standards based on a consensus of all interested parties. At the current time, this means use of a markup design like the Text Encoding Initiative Guidelines formulated under the Standard Generalized Markup Language architecture, adopted in 1986 by the Organization for International Standardization (ISO 8879). Relevant standards for images and other material have yet to be selected for the Partnership models." [from the Prospectus]
"Text encoded under the Standard Generalized Markup Language has become the de facto standard for creating electronic text. We will use the Text Encoding Initiative's markup to create an SGML archive for samples from each edition. From the archive, we will create both CD-ROM and Internet models." [from the Work Plan]
[September 02, 1999] As of September 1999, the MEP Web site hosted seven mini-editions. "Four of the experimental mini-editions are based on full-text searchable document transcriptions; two are based on document images; and one is based on both images and text." These include: (1) Documentary History of the First Federal Congress, (2) Documentary History of the Ratification of the Constitution and the Bill of Rights, (3) Papers of Henry Laurens, (4) Abraham Lincoln Legal Papers, (5) Papers of General Nathanael Greene, (6) Margaret Sanger Papers, (7) Papers of Elizabeth Cady Stanton and Susan B. Anthony. "The DynaText and Dynaweb software from Inso has been used to present the mini-editions; this software allows users to construct powerful searches or to use a series of built-in search forms. The mini-editions can be searched using the full range of standard search tools -- wildcards, proximity searching and Boolean searching." Dynatext also "has built-in support for search of tagged documents with hierarchical structures, such as HTML and XML. By permitting searches of words and phrases inside particular tags, as well as words in documents, DynaText allows users to efficiently target their searches, resulting in more relevant, focused matches."
References:
- MEP Home Page
- "Markup Guidelines for Documentary Editions." Prepared by David R. Chesnutt, Susan M. Hockey, and C. M. Sperberg-McQueen. Updated 4-July-1999 or later. [local archive copy]
- MEP Participants
- A Prospectus for Electronic Historical Editions" [mirror copy, text only; July 1996]]
- The Model Editions Partnership, David Chesnutt, d-lib [mirror copy, January 1996]]
- "The Model Editions Partnership -- Towards a National Database." By David Chesnutt. ACH-ALLC '97 Presentation.
- The Model Editions Partnership. 'Smart Text' and Beyond." By David R. Chesnutt. D-Lib Magazine July/August 1997 [ISSN 1082-9873].
- Model Editions Partnership - Site Reports, from C. M. Sperberg-McQueen
- MEP Participants
- MEP Work Plan [mirror copy, January 1996]]
- CETH: Model Editions Partnership [mirror copy, January 1996]]
American Memory Project, Library of Congress
[CR: 19970806] [Table of Contents]
"American Memory consists of collections of primary source and archival material relating to American culture and history. These historical collections are the Library of Congress's key contribution to the national digital library. Most of these offerings are from the unparalleled special collections of the Library of Congress."
"The elements in each historical collection include digital reproductions of items, a finding aid, and various accompaniments. The finding aid may consist of a catalog (a database of bibliographic records) or take the form of a register (a hierachical listing or directory)."
The principal standard for text encoding in the American Memory project is SGML, sometimes in TEI-SGML. See the Library of Congress - EAD Finding Aid Pilot Project main entry for other technical information, or "American Memory pilot--seed of a universally available Library".
Woman Suffrage Collection
One of the collections of American Memory is the Woman Suffrage Collection. "The NAWSA collection consists of 165 books, pamphlets and other artifacts documenting the suffrage campaign. They are a subset of the Library's larger collection donated by Carrie Chapman Catt, longtime president of the National American Woman Suffrage Association, in November of 1938. The collection includes works from the libraries of other members and officers of the organization including: Elizabeth Cady Stanton, Susan B. Anthony, Lucy Stone, Alice Stone Blackwell, Julia Ward Howe, Elizabeth Smith Miller, Mary A. Livermore."
Texts are prepared in SGML. See Woman Suffrage Collection: Technical Note on Texts [mirror]: "This full text collection provides researchers with an SGML-encoded (Standard Generalized Markup Language) version of the full text in addition to an HTML-encoded version of the same text. . .Images of the pages and illustrations can be accessed by a viewer launched from Panorama."
WPA Life Histories
Another collection of American Memory is: Life History Manuscripts from the Folklore Project, WPA Federal Writers' Project, 1936 - 1940. "These life histories were written by the staff of the Folklore Project of the Federal Writers' Project for the U.S. Works Progress (later Work Projects) Administration (WPA) from 1936-1940. The Library of Congress collection includes 2,900 documents representing the work of over 300 writers from 24 states. Typically 2,000-15,000 words in length, the documents consist of drafts and revisions, varying in form from narrative to dialogue to report to case history. The histories describe the informant's family education, income, occupation, political views, religion and mores, medical needs, diet and miscellaneous observations. Pseudonyms are often substituted for individuals and places named in the narrative texts."
The texts are encoded in SGML, as explained in WPA Life Histories--Editor's and Technical Notes [mirror copy]: "When initially transcribed, these texts were marked up in Standard Generalized Markup Language (SGML). The American Memory SGML markup scheme conforms to the guidelines of the Text Encoding Inititiative (TEI), the work of a consortium of scholarly institutions. Since this Internet presentation employs the conventions of the World Wide Web, the SGML markup has been simplified and reprocessed to create documents in HyperText Markup Language (HTML). In the final version, SGML markup will be utilized. Interested persons may obtain the American Memory SGML document type definition (DTD) and related information by file transfer protocol (ftp) from the Library of Congress server."
African American Pamphlets
A third collection using SGML encoding is African-American Pamphlets from the Daniel A. P. Murray Collection, 1880-1920, Rare Book and Special Collections Division, Library of Congress. "The Daniel A. P. Murray Pamphlet Collection presents a panoramic and eclectic review of African-American history and culture, spanning almost one hundred years from the early nineteenth through the early twentieth centuries, with the bulk of the material published between 1875 and 1900. Among the authors represented are Frederick Douglass, Booker T. Washington, Ida B. Wells-Barnett, Benjamin W. Arnett, Alexander Crummel, and Emanuel Love."
The document African American Pamphlets: Technical Note on Texts explains the use of SGML in the text encoding.
Links for American Memory
- American Memory Home Page
- American Memory White Papers
- See also: Library of Congress - Encoded Archival Description (EAD) - Finding Aids Project
- Technical Papers and Other Information About American Memory
- RFP96-18 "Digital Images from Original Documents, Text Conversion and SGML-Encoding." Section C Description/Specification/Work Statement - documents the use of SGML[mirror copy]
- Carl Fleischhauer, Coordinator, American Memory, ELEMENTS OF DIGITAL ARCHIVAL COLLECTIONS: TECHNICAL OVERVIEW AND FORMAT DESCRIPTION (October 27, 1994) [mirror, partial links only]
- Carl Fleischhauer, "Frameworks and Finding Aids: Organizing Digital Archival Collections" [mirror, partial links only]
- Lapeyre and Usdin, "TEI and the American Memory Project at the Library of Congress" [mirror copy]
- Electronic Text Workshop Proceedings, June 9-10, 1992
Brown University Scholarly Technology Group (STG)
[CR: 19970902] [Table of Contents]
Under the guidance of Allen Renear (Director), the Brown University Scholarly Technology Group (STG) "supports the development and use of advanced information technology in academic research, teaching, and scholarly communication. STG pursues this mission by exploring new technologies and practices, developing specialized tools and techniques, and providing consulting and project management services to academic projects. STG focuses on four related areas: (1) educational applications of hypertext and hypermedia; (2) SGML textbase development networked scholarly communication; (3) electronic curriculum and collaborative learning environments."
STG's SGML Textbase Development is an example of the technology focus: "STG is committed to open, high-function standards for data representation. Most important among these are SGML (the Standard Generalized Markup Language, a meta-grammar for developing encoding systems for textual data), and two SGML-based encoding systems: HTML (Hypertext Markup Language, used in World Wide Web) and TEI (Text Encoding Initiative Guidelines). Among STG's consultants are internationally active experts in SGML and TEI, and one of its affiliated projects, the Women Writers Project, is among the world's leading SGML/TEI databases."
Links:
- Home Page, Index
- The Scholarly Technology Group, Description [August 1997; December 1996 mirror copy]
- Scholarly Technology Group staff members
- STG Projects
- SGML & HTML Document Validation Service: enter the URL or the document. "This service validates SGML (and therefore HTML) documents using nsgmls, a part of the SP SGML parser. It is currently set up to validate against DTD`s (SGML rules) for most HTML dialects, as well as some of the SGML document types we use here at STG."
- STG DynaWeb Server w/ DynaWeb 3.01, and 3.1, from Inso [formerly: Electronic Book Technologies, Inc.]
Addresses:
Scholarly Technology Group
Computing and Information Services
Box 1885
Brown University
Providence, RI 02912
USA
Tel: 401-863-7312
Fax: 401-863-9313
WWW: http://www.stg.brown.edu/
Email: info@stg.brown.edu
The Brown University Women Writers Project
[CR: 19990329] [Table of Contents]
The Women Writers Project is creating a full-text database of women's writing in English from the period 1330-1830. Texts are encoded in TEI SGML, as explained in the following excerpt from the online overview. "The WWP is developing its encoding system in close cooperation with the international Text Encoding Initiative, of which it is a leading affiliated project. Members of the WWP participate in TEI activities in various ways and participate in research on text encoding and computing methodology. The use of the TEI encoding guidelines ensures not only a very high level of encoding sophistication and sensitivity to scholarly needs, but also, because the TEI Guidelines conform to international standards (namely ISO 8879:1986 SGML), the resulting WWP textbase is entirely free of hardware and software dependencies. Creating this textbase and developing derived products also involves the WWP in related research and scholarship on the application of information technology to humanities research and teaching -- particularly literary text encoding, textbase development, computer-based publishing and textual editing, and computer-supported collaborative work (CSCW) in the humanities."
[March 29, 1999] In March 1999, Julia Flanders of the Brown University Women Writers Project posted an announcement indicating that the WWP textbase is now freely available online in a beta-test version. The Women Writers Project textbase "is a collection of pre-Victorian women's writing in English. The initial publication will include over 200 texts from the period 1450-1830, with 50-100 more being added in the first year. The texts cover a huge range of genres and topics, and represent an unparalleled resource for the study of women's writing and history, and of English literature generally." Features of the system include: "(1) The texts are richly encoded in SGML, using the full TEI Guidelines. The transcription preserves the text of the original document in full, including all front and back matter, with original pagination, typography, spelling, and rendition. Title pages, signatures, catchwords, and other bibliographic details are transcribed in full. (2) The textbase will be published over the web using Inso's DynaWeb software, giving the user full access to the SGML tagging for searching and navigation. (3) Varied style sheets will allow the user to view the text with its original typography and errors intact, or in a corrected and regularized form. (4) Users may search the entire textbase or individual texts for words and phrases, either on their own or within specified contexts, using the SGML markup. Users may also search for sets of texts which meet certain criteria such as date, genre, place of publication, and so forth. (5) The primary source material will be accompanied by topic essays and biographical information for each author."
Links:
- The Brown University Women Writers Project Home Page
- An Overview of the Brown Women Writers Project
- Research and Encoding
- WWP History
- Women Writers Project Staff and Advisors
- Women Writers Project Text List
- Resources
- Newsletter
- Related web sites for the study of women's writing
- WWP Internal Reference and Documentation Collection
- Renaissance Women Online Collection
- "Putting Renaissance Women Online", by Paul Caton, Brown University Women Writers Project. See the ICCH Conference entry.
Addresses:
Women Writers Project
Box 1841
Brown University
Providence, RI 02912
USA
Tel: +1 (401) 863-3619
FAX: +1 (401) 863-9313
Email: WWP@brown.edu
Midrash Pirqe Rabbi Eliezer Electronic Text Editing Project
[CR: 19970428] [Table of Contents]
[Site under construction, by editor Lewis Barth]
The project addresses ". . .the process of creating a manual for encoding an electronic edition of Pirqe Rabbi Eliezer (Pirqe R. El.), the Chapters of Rabbi Eliezer. Pirqe R. El. is a midrashic retelling of significant aspects of the biblical narrative, from the creation story through the Book of Esther. . . The initial goal of this project was to create a critical edition of Pirqe Rabbi Eliezer. The goal has now expanded to include electronic publication of all Pirqe R. El. manuscripts and fragments in two forms: digital facsimiles and transcriptions with hypertext links. There are two reasons for this: 1) the quantity of textual material and 2) recent hypotheses regarding the development of medieval Hebrew manuscripts which argue that each manuscript of a work is a completely new literary creation. . .[We conclude] SGML/TEI markup is particularly useful for scripturally based text, i.e., texts from the vast literatures of Judaism, Christianity and Islam which frequently cite biblical or koranic verses. There are numerous genres in these religious literatures (exegetical works, homilies, scriptural essays, dialogues, legal texts, liturgical texts, religious poetry, etc.). They all have in common the citation of texts sacred to a religious community, the frequent mention of characters, places and institutions found in such texts, plus references to later individuals, places and institutions. In addition, these texts are often macaronic, i.e., they contain more than one human language."[extract of ACH paper, below; provisional]
Links:
- Project Home Page
- THE HEBREW CHARACTER REPRESENTATION TABLE IN SGML
- Mirror copy of ACH Paper, HTML format
- INDEX OF HUC MS. 75: DIGITIZED COPY [sample images of mss in .GIF format]
- HUC First Edition - a complete digitization of the first edition of Pirq. Rabbi Eliezer, Constantinople, 1514, from the HUC Klau Library
- Link on the TEI Project Descriptions Page
- See provisionally: "Electronic Edition of the Midrash Pirqe Rabbi Eliezer: Creating an Encoding Manual," by Lewis M. Barth, pp. 34-36 in ACH-ALLC '96. Abstracts. University of Bergen, June 25-29, 1996.
- Or the PDF source for ACH-ALLC paper: http://www.hd.uib.no/barth1.pdf [mirror copy]
Addresses:
Pirqe Rabbi Eliezer Electronic Text Editing Project
Attention: Lewis M. Barth
Hebrew Union College - Jewish Institute of Religion
3077 University Avenue
Los Angeles, California 90007-3796
Office: (213) 749-3424
Office FAX: (213) 749-1192
Internet: lbarth@mizar.usc.edu
University of Cincinnati College of Law, Center for Electronic Text in the Law
[CR: 20001002] [Table of Contents]
"CETL currently produces two text databases that can be accessed from the Internet. The first is the University of Cincinnati's portion of DIANA, a unique database of human rights materials. In cooperation with a numbe r of other North American law school libraries, CETL offers through the DIANA database a comprehensive source of human rights documents to researchers and activists around the wor ld and supports the work of the Urban Morgan Institute for Human Rights, an institution affiliated with the College of Law. The documents that the University of Cincinnati contributes to DIANA are Standard Generalized Markup Language (SGML) versions of United Nations human rights documents, historic United Nations material and documents from the Organization of African Unity. Putting these documents into SGML optimizes them for users' needs,b ecause SGML allows maximum access to the information they contain in a variety of formats across computing platforms."
"The second database, the Securities Lawyer's Deskbook, provides electronic acc ess from the Internet to the text of the Securities Act of 1933 and the Securities Exchange Act of 1934, together with the rules and forms necessary for compliance with these statutes. The existence of this database aids practitioners and scholars and su pports the work of the College's Center for Corporate Law."
CETL makes use of DynaWeb (from EBT)for management and delivery of documents from an SGML database. Documents themselves use the (abridged) TEI Header for bibliographic control. By clicking on the "TEI" icon for a given document, the SGML version is sent by DynaWeb instead of the HTML version. DynaWeb is a Web server software that, in addition to supporting standard HTTPD Web server protocol, "converts DynaText electronic books stored in SGML into HTML on-the-fly for rapid navigating and searching by any Web browser. . . This product effectively shields publishers from the evolving HTML standards by allowing them to store and manage documents in the stable SGML format, and subsequently re-target the information to the latest version of HTML with minimal incremental effort."
The Legal Electronic Text Consortium is asociated with Diana: it is "comprised of a number [thirteen (13) as of August 30, 1996] of academic and research law libraries whose common goal is to further the digitization of legal materials through research and cooperative development. . . Members of LETC are currently engaged in a number of cooperative projects. These include the building of the DIANA database of human rights documents and the development of legal extensions to the Text Encoding Initiative SGML document type definition."
- CETL Home Page
- Center for Electronic Text in the Law
- Center for Electronic Text in the Law: Text Production [Provides a detailed presentation on methods used by CETL for the creation and delivery of XML- and SGML-encoded documents, including a survey of the specific software tools used.]
- CETL Staff
- From Nicholas D. Finke, Director of the Center for Electronic Text in the Law: a posting to TEI-L describing other projects which use SGML in the field of law.
- DIANA Collections using DynaWeb
- Sample documents (Charter of the United Nations), using the DynaWeb interface
- About DIANA
- Legal Electronic Text Consortium
- More on DynaWeb
Related resources:
Project addresses:
Center for Electronic Text in the Law
University of Cincinnati College of Law Library
Clifton and Calhoun Streets
P.O. Box 210142
Cincinnati, OH 45221-0142
Tel: (513) 556-0103
FAX: (513) 556-6265
Email: cetl@law.uc.edu
British National Corpus Project (BNC)
Description 2007-11:
"The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written...
The latest edition [as of 2007-11] is the BNC XML Edition, released in 2007. The written part of the BNC (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. The spoken part (10%) includes a large amount of unscripted informal conversation, recorded by volunteers selected from different age, region and social classes in a demographically balanced way, together with spoken language collected in all kinds of different contexts, ranging from formal business or government meetings to radio shows and phone-ins.
The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual and bibliographic information is also included with each text in the form of a TEI-conformant header.
Work on building the corpus began in 1991, and was completed in 1994. No new texts have been added after the completion of the project but the corpus was slightly revised prior to the release of the second edition BNC World (2001) and the third edition BNC XML Edition (2007). Since the completion of the project, two sub-corpora with material from the BNC have been released separately: the BNC Sampler (a general collection of one million written words, one million spoken) and the BNC Baby (four one-million word samples from four different genres)..."
Description vintage-1995:
"... the BNC is a very large (100 million words) corpus of modern English, both spoken and written, produced by an academic/industrial consortium lead by Oxford University Press, involving Longman UK Ltd, Chambers/Larousse, Oxford University Computing Services, the University of Lancaster and the British Library. Production of the corpus was funded by the commercial partners and by the UK Government, under the DTI/SERC Joint Framework for Information Technology." [...] At the last count, the corpus contained 104 million words, totalling about 1.6 gigabytes of disk space. The corpus is automatically segmented into orthographic sentence units, and each word in the corpus is automatically assigned a word class (part of speech) code by the CLAWS software developed at the University of Lancaster. The corpus is encoded according to the TEI (Text Encoding Initiative)'s Guidelines, using the ISO standard SGML to represent this and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual and bibliographic information is also included with each text in the form of a TEI conformant header file." [from the FAQ, November 1994]
The [encoding] format used by the BNC is called the Corpus Document Interchange Format (CDIF for short) and is fully documented in the CDIF Reference Manual. An article by Dominic Dunlop and Gavin Burnage titled Encoding the British National Corpus, written while the BNC was being developed, describes the scheme and its use within the project in some detail. CDIF is an application of SGML (ISO 8879: Standard Generalized Markup Language) and can therefore be used with any SGML-compliant software. SGML is a widely used international standard format for which many public domain and commercial utilities already exist; new software is also coming on the market very rapidly. CDIF is formally defined by an SGML Document Type Definition (DTD)." [from the Encoding description, March 1995]
References:
- BNC XML Edition
- Reference Guide for the British National Corpus (XML Edition). February 2007 or later.
- Using BNC with Xaira/SARA
- BNC web site index
- BNC web site home page
Earlier references:
- See the BNC FAQ document [(mirror July 1995)]
- See the slightly earlier BNC FAQ document
- Connect to the British National Corpus WWW server
- Linguistic tagging of the British National Corpus
- BNC Text Encoding procedures, mirrored here
- See the article of Dominic Dunlop in the CHUM Special Issue on TEI
- Dunlop, Dominic; Burnage, Gavin. "Encoding the British National Corpus." A description published in [pages ??-?? of] English Language Corpora: Design, Analysis and Exploitation: Papers from the 13th International Conference on English Language Research on Computerized Corpora [Nijmegen 1992], edited by Jan Aarts, Pieter de Haan and Nelleke Oostdijk. The document may be available from the BNC WWW server, or see this mirror copy.
- CDIF Reference Manual (Corpus Document Interchange Format)
- Dunlop, Dominic. "The Relationship Between the TEI.2 Header and the BNC Corpus and Text Headers." Technical Report TGCW34. September 4, 1992. 37 pages. Compares the BNC Text Header with the TEI (P2) Header (also available from UICVM Listserver ['APBNW1 PS'] and other TEI sites)
- SARA (SGML-Aware Retrieval Application) - Overview [mirror copy, March 04, 1996]
- SARA [mirror copy]
- SARA Protocol, BNC reference manual (SGML- Panorama or PDF)
Addresses:
British National Corpus
Oxford University Computing Services
13 Banbury Road
Oxford OX2 6NN
TEL: +44 (1865) 273 280
FAX: +44 (1865) 273 275
Email: natcorp@oucs.ox.ac.uk
Email: natcorp@vax.ox.ac.uk
Email: dominic@natcorp.ox.ac.uk (Dominic Dunlop)
Linguistic Data Consortium (LDC)
[CR: 19981002] [Table of Contents]
"The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC's host institution. The LDC was founded in 1992 with a grant from ARPA, and is partly supported by grant IRI-9528587 from the National Science Foundation."
"The best formatting mechanism for text is Standard Generalized Markup Language (SGML); it is widely and commonly used (more so than SPHERE: the HyperText Markup Language (HTML), which is the format used throughout the World Wide Web, is actually one instance of SGML usage), it can be kept quite simple, there is free software available to support its use, and it is adaptable to a wide range of languages and uses. It includes the notion of a "Document Type Definition" (DTD), which provides a clear and complete specification of the markup used in a given collection of text. The LDC does not require that a fully functional DTD be supplied, or that the SGML tagging of a text collection be fully compliant to a given set of conventions (e.g. those developed by the Text Encoding Initiative, TEI); what is essential is that the markup be clear, consistent, and correctly applied, so that it can be "parsed" according to a finite set of rules."
Several of the text corpora included in distributions from the LDC use SGML encoding. For example, with respect to the Association for Computational Linguistics Data Collection Initiative (ACL/DCI), 620 MB: "The many formats in which the originals of these texts came have all, to one extent or another, been mapped into a markup language consistent with the SGML standard (ISO 8879). SGML provides a labelled bracketing of the text, with labels permitted to have associated feature-value pairs. Eventually, ACL/DCI will be furnished with tags conformant to the Text Encoding Initiative standards. Because of time constraints, the files in this initial release are not so conformant, and thus are likely to be re-released eventually in a conformant state. The ACL/DCI welcomes help in establishing "proper" SGML coding for all of its collection."
Or, with respect to the United Nations Parallel Text Corpus (English, French, Spanish) [Catalog number LDC93T4A; set of three compact discs]: "In preparing the text for publication, we have applied a fully-compliant SGML format (Standard Generalized Markup Language). For those researchers who use SGML, a working DTD (Document Type Definition) is provided on each disc. For those who do not need SGML markup, a simple script is included that can be used to filter out the SGML-specific material, and leave only the plain text. The character set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and some other non-ASCII characters occupy the upper 128 entries of the character table."
With respect to the European Corpus Initiative Multilingua Corpus I [Catalog number LDC94T5]: "Most of the data is marked up in TEI-compliant SGML -- see mci.edt for discussion, and the bin and src directories for tools to assist in processing and accessing the data. The top-level file mci.sgm provides an SGML way in to the corpus as a whole, or for selected parts of it -- again see mci.edt for further instructions."
[Re: TIPSTER format] "The format uses a labelled bracketing, expressed in the style of SGML (Standard Generalized Markup Language). The SGML DTD's used for verification at NIST are included on the CDs. All five different datasets have their major structures identical for easier reading, but have different minor structures. The philosophy in the formatting both at the University of Pennsylvania and at NIST has been to preserve as much of the original structure as possible, but to provide enough consistency to allow simple decoding of the data."
Links:
- Linguistic Data Consortium Home Page
- LDC SGML encoding: Through a now refined process the Language Analysis Center is able to produce a final digitized text of approximately 8,000 entries complete with SGML tags, in the span of one month. . . All dictionaries are fully compliant with the latest version of the SGML and TEI guidelines. A Document Type Definition (DTD) is used to describe the structure of tags for each dictionary. It is a fluid document and is delivered with the final version of the project."
- [October 01, 1998] The Linguistic Data Consortium located at the University of Pennsylvania has announced the release of a new text corpus in the JURIS (Justice Department Retrieval and Inquiry System) collection, from the U.S. Department of Justice. The new two-CD-ROM JURIS set contains a total of 694,667 document units in 1664 individual text files, with text data ranging from the 1700's to the early 1990's. Examples come from Case Law, Executive Orders, Treaties and other International Agreements, Federal Regulations, Administrative Law, Department of Justice Briefs, Freedom of Information Act, Indian Law, Statutory Law, Immigration and Naturalization Law, Tax Law, etc. As with much of the LCD corpus material, these documents are structured in SGML: "The text files are all formatted using a set of SGML tags to mark document boundaries, and to mark major structural features within documents. As with file organization, the markup is derived from the document structures as provided by the Justice Department."
- [October 01, 1998] Also released by LDC is a corpus of "1997 Mandarin Broadcast News Speech and Transcripts." These data are encoded using "SGML tagging to identify story boundaries, speaker turn boundaries, and phrasal pauses; these tags include time stamps to align the text with the speech data. Word segmentation (white-space between words) is included. A working DTD is provided, and the markup is consistent with that of the 1997 English and Spanish Hub-4 collections."
- [April 21, 1998] Announcement from the Linguistic Data Consortium for the release of a new SGML-encoded speech corpus. The 1996 Broadcast News Speech Corpus "contains a total of 104 hours of broadcasts from ABC, CNN, and CSPAN television networks and NPR and PRI radio networks with corresponding transcripts" (including programs such as ABC Nightline, ABC World Nightly News, CNN Headline News, CSPAN Washington Journal, NPR All Things Considered, NPR Marketplace, and others). The released version of the transcripts is in SGML format, and there is accompanying documentation, and an SGML DTD file, included with the transcription release."
- About the Linguistic Data Consortium
- Catalog: Speech Corpora
- Catalog: Text Corpora
- The LDC as publisher and distributor of speech corpora; [mirror copy]
- Catalog: Corpora Available from The Linguistic Data Consortium
- Sample news release: Spanish News Corpus [March 1996]"The presentation of text data in these collections is modeled on the TIPSTER corpus. Within each data file, SGML tagging is used (1) to mark article boundaries, (2) to delimit the text portion within each article, and (3) to label various pieces of information about the article that are external to the text content (e.g. headlines, bylines, and so on."
- FTP site: ftp.cis.upenn.edu/pub/ldc
- Sample Announcement: "European Language Newspaper Text" ("...that has been marked using SGML - 65 million words of French"); see also the description on the LDC server
Addresses:
Linguistic Data Consortium
3615 Market Street
Suite 200
Philadelphia, PA 19104-2608
Phone: 215-898-0464
FAX: 215-573-2175
Email: ldc@ldc.upenn.edu [General Information]
Email: online-service@ldc.upenn.edu [LDC-Online]
IATH - Institute for Advanced Technology in the Humanities, University of Virginia at Charlottesville
[CR: 20000225] [Table of Contents]
IATH (Institute for Advanced Technology in the Humanities) at the University of Virginia at Charlottesville sponsors text analysis as part of its broad goal "to explore and expand the potential of information technology as a tool for humanities research." Several IATH projects use SGML encoding in the preparation of electronic scholarly text editions, and software developed under IATH auspices has occasionally been released as well. Projects having a significant SGML emphasis have included the Rossetti Archive, the Piers Plowman database, the Walt Whitman Hypertext Archive, the Blake Archive, and others. Structured searches of SGML documents are supported for several collections, including Dante's Inferno, Blake Illuminated Books, The Greek Manumissions Project, and others.
The Institute for Advanced Technology in the Humanities has developed some SGML (aware) software in connection with its digital library projects. For example, Inote (An Image Annotation Program in Java), "can automatically identify lines or columns of text for annotation, and [the authors] are working on SGML utilities that will allow a user to connect SGML transcriptions and annotated images." MU: Web-Based SGML Markup is a set of Perl programs which, in combination with a Web server, allow one person or a group of people to create and modify SGML files, using standard forms-capable Web browsers as the editing interface. MU supports multiple e diting sessions through lock-files, and it builds its forms from simple ascii-text tag templates. MU is distributed with a sample template for the TEI-lite DTD." Also, in early 1995, IATH announced (pre-release) Babble: A Synoptic Text Viewer. "Babble, under development by Robert Bingler, is an SGML-capable synoptic text tool that can display multiple texts in parallel windows. It uses Unicode, an ISO 16-bit character set standard, which allows multilingual texts, using mixed character sets, to be displayed simultaneously. Babble also allows users to search for strings in text or in tags, and to link open texts for scrolling and searching.
IATH links:
- The IATH Home Page
- IATH: Marking Up Digital Images; [local archive copy]
- Inote: An Image Annotation Program - 'we are working on SGML utilities that will allow a user to connect SGML transcriptions and annotated images.'
- Babble: A Synoptic Unicode Browser - SGML-capable
- MU: Web-Based SGML Markup
- Iteach: Software for Distance Learning - 'Iteach is a Java-based toolkit for real-time, networked instruction. It includes an html/xml-compliant text editor with basic text-formatting features, group chat, a whiteboard, and a calculator. Text and image annotation tools will be added soon.'
- SGML at the Institute (IATH) - Overview of SGML
- Research reports
- General reports
- IATH Perl Scripts for Working with SGML Files
- IATH Perl Script: Create/Add SGML Line Tag Attributes.
IATH: Piers Plowman Database (Hoyt Duggan)
[CR: 19980105] [Table of Contents]
Included in the IATH archive is the Piers Plowman database, demonstrating the work of Hoyt Duggan. The Piers database uses TEI SGML in the encoding of text-critical information. See:
- Piers Table of Contents.
- Creating an Electronic Archive of Piers Plowman
- the main directory for the Piers database.
- Piers Plowman: Archive Goals
- Update on the electronic Piers Plowman project (December 14, 1994) by Hoyt Duggan
IATH: Rossetti Archive (Jerome J. McGann)
The Rossetti Archive at IATH features the writings and pictures of Rossetti encoded in SGML. "The Rossetti Archive is a hypermedia environment for studying the works of the Pre-Raphaelite poet and painter D[ante] G[abriel] Rossetti (1828-1882). The archive is a structured database holding digitized images of Rossetti's works in their original documentary forms. Rossetti's poetical manuscripts, early printed texts - including proofs and first editions - as well as his drawings and paintings are stored in the archive, in full color as needed. The materials are marked up for electronic search and analysis, and they are supplied with full scholarly annotations and notes. . . A key feature of the structure of the Archive is its SGML markup (SGML= Standard Generalized Markup Language). This is a formal marking scheme that establishes a set of conceptual categories of information that are determined to be especially important for study purposes. The documents in the Archive are all SGML marked to allow the documents to be searched and analyzed for the marked up features and categories. Thus, all of the pictorial documents ae marked for a full physical description of the picture (e.g., medium, dimensions, frame, etc.) or a full treatment of its production and transmission history. Similar formal categories are established for searching and analyzing the Archive's other documents (printed texts, manuscripts, proofs, etc.). The Archive has a search engine (Pat/Lector) for executing the analytic operations made possible by the SGML markup scheme." (from the home page)
See:
- The Rossetti Archive
- The Rossetti Hypermedia Archive: Introduction
- Rossetti Archive DTD (and tag documentation)
- SGML: Rossetti Archive DTD
- Index of /rossetti/
- Searchable Index of rossetti database (WAIS)
- SGML: The Rossetti Archive and Image-Based Electronic Editing, by J. McGann
IATH: William Blake Archive
[CR: 20000225] [Table of Contents]
[February 25, 2000] TEI-Encoded Edition of Erdman's Complete Poetry and Prose of William Blake. Matt Kirschenbaum recently posted an announcement which reports on a significant milestone reached in the Blake Archive. "The editors of the William Blake Archive are very pleased to announce the publication of our searchable SGML-encoded electronic edition of David V. Erdman's Complete Poetry and Prose of William Blake. The Blake Archive's electronic Erdman is tagged in SGML using the Text Encoding Initiative DTD and is presented online using Inso's DynaWeb software. But we should note that Erdman's edition is an extraordinarily rich and complex textual artifact in its own right, and encoding and rendering it has proven a substantial technical challenge. The addition of the electronic Erdman means that the site is now inclusive of an even greater range of Blake's work than the approximately 3000 digital i

