Cover Pages: Perseus Project

[October 26, 2000] "The Perseus Project is an evolving digital library of resources for the study of the ancient world and beyond." The project began in 1985 under the direction of Gregory Crane, producing heterogeneous collection of materials, textual and visual, on the Archaic and Classical Greek world. Since then, the Perseus Project has published two CD-ROMs and created the on-line Perseus Digital Library. The Perseus Digital Library Project has encoded several thousand documents of early Greek and Latin using SGML/XML markup -- "hundreds of megabytes of SGML texts, with 375,000 explicit links between resources." It is one of the most elaborate and successful digital library projects ever designed.

References:

Perseus Project web site
"Knowledge Management in the Perseus Digital Library." By Jeffrey A. Rydberg-Cox, Robert F. Chavez, David A. Smith, Anne Mahoney, and Gregory R. Crane. In Ariadne [ISSN: 1361-3200] Issue 25 (September 2000). "The Perseus digital library is a heterogeneous collection of texts and images pertaining to the Archaic and Classical Greek world, late Republican and early Imperial Rome, the English Renaissance, and 19th Century London. The texts are integrated with morphological analysis tools, student and advanced lexica, and sophisticated searching tools that allow users to find all of the inflected instantiations of a particular lexical form. The current corpus of Greek texts contains approximately four million words by thirty-three different authors. Most of the texts were written in the fifth and fourth centuries B.C.E., with some written as late as the second century C.E. The corpus of Latin texts contains approximately one million five hundred thousand words mostly written by authors from the republican and early imperial periods. The digital library also contains more than 30,000 images, 1000 maps, and a comprehensive catalog of sculpture. Collections of English language literature from the Renaissance and the 19th century will be added in the fall of 2000. In developing this collection of SGML and now XML documents, we have benefited from the generality and abstraction of structured markup which has allowed us to deliver our content smoothly on a variety of platforms. The vast majority of our documents are tagged according to the guidelines established by the Text Encoding Initiative (TEI). While we have had a great deal of success with these guidelines, other digitization projects have found other DTDs more useful for their purposes. As XML becomes more widely used, more and more specifications for different subject fields and application domains are being created by various industries and user communities; a well known and extensive list of XML applications includes a wide variety of markup standards for different domains ranging from genealogy to astronomy. Customized DTDs ease the encoding of individual documents and often allow scholars to align their tags with the intellectual conventions of their field. At the same time, they can raise barriers to both basic and advanced applications within a digital library. . . One of the challenges in building this type of [digital library] system is the ability to apply these sorts of tools in a scalable manner to a large number of documents tagged according to different levels of specificity, tagging conventions, and document type definitions (DTDs). To address this challenge, we have developed a generalizable toolset to manage XML and SGML documents of varying DTDs for the Perseus Digital Library. These tools extract structural and descriptive metadata from these documents, deliver well formed document fragments on demand to a text display system, and can be extended with other modules that support the sort of advanced applications required to unlock the potential of a digital library." [cache]
"Generalizing the Perseus XML Document Manager." By Anne Mahoney, Jeffrey A. Rydberg-Cox, and Clifford E. Wulfman. To be presented at Workshop on Web-Based Language Documentation and Description, December 12-15, 2000, Institute for Research in Cognitive Science (IRCS) University of Pennsylvania Philadelphia, Pennsylvania, USA.
Designing Documents to Enhance the Performance of Digital Libraries." By Gregory Crane. In D-Lib Magazine Volume 6 Number 7/8 (July/August 2000). "In tagging texts, we begin with the basic document structure: chapters, sections, headers, notes, blockquotes, etc. We have only begun the process of identifying individual bibliographic citations and linking these to formal bibliographic records for author and work. We have tagged most foreign language quotations, letters, extracts of poetry, etc. by hand 5. Two other levels of information are added to the documents. The boundary between these levels is flexible but the general distinction is clear. When we can identify particular semantic classes with reasonable reliability, we encode this information as tags within the SGML/XML files. The Perseus XML Document manager processes the tagged texts and images. A linked GIS manages the geospatial data. Many operations are performed on the data, the most important of which establish automatic connections between different and otherwise isolated parts of the collection."
"Management of XML documents in an integrated digital library." By David Smith, Anne Mahoney, and Jeffrey A. Rydberg-Cox (Perseus Project). Presented at Extreme Markup Languages 2000, Friday, August 18, 2000. Published in the proceedings volume, pages 219-224. "Using a variety of DTDs and markup practices eases the coding of individual documents and often achieves a better fit with their intellectual structures, but it can raise barriers to resource discovery within a digital library. We describe a generalized toolset developed by the Perseus Project to manage XML documents in the context of a large, heterogeneous digital library. The system manages multiple DTDs by creating a partial mapping between elements in a DTD and abstract structural elements. The tools then extract and index structural metadata from these documents in order to deliver document fragments on demand, manage document layout, and support linguistic and conceptual analysis such as feature extraction... One of the greatest challenges in building and maintaining a large, heterogeneous DL (digital library) is the necessity of managing documents with widely varying encodings and markup practices. Although the World Wide Web has demonstrated the power of simple links among simple documents, the benefits of more highly structured markup have long been understood. The Perseus digital library project has developed a generalizable toolset to manage XML (Extensible Markup Language) documents of varying DTDs (Document Type Definitions); to extract structural and descriptive metadata from these documents and deliver document fragments on demand; and to support other tools that analyze linguistic and conceptual features and manage document layout. In over ten years of creating and managing SGML and now XML data, we have been greatly helped by the generality and abstraction of structured markup, which has allowed us to deliver our content smoothly on a variety of platforms, from standalone CD-ROMs, to custom client-server software, to the World Wide Web. In digitizing historical and scholarly documents, we have also come to appreciate the richness of the implicit and explicit links among printed resources. Our DL system reifies these connections and tries to meet the challenges of automatically generating hypertexts in electronic media. Most often needed in creating a rich hypertext across a digital library are models of the structure of individual documents and descriptions of their content. These models ought to be independent of the particular encodings of those documents. Use of these abstractions allows rapid development of scalable tools for display, linguistic analysis, knowledge management, and information retrieval within the DL system. We describe an engine to leverage the power of XML for this modeling task and some of its applications in building a hypertextual digital library. This document management system is the back end for a production web server that delivers over 2 million pages a week; it went into production early in March, 2000." [Extreme 2000 paper supplied by David Smith.] [cache]
[August 03, 2001] Smith, David A.; Mahoney, Ann; Rydberg-Cox, Jeffrey A. "Managing XML Documents in an Integrated Digital Library." [PROJECT REPORT] In Markup Languages: Theory & Practice 2/3 (Summer 2000) 205-214 (with 21 references). ISSN: 1099-6622. "The Perseus Project developed a generalized toolset to manage XML documents in the context of a large, heterogeneous digital library. The system manages multiple DTDs through mappings from elements in the DTD to abstract document structures. The abstraction of document metadata, both structural and descriptive, facilitates the development of application-level tools for knowledge management and document presentation. Implementation of the XML back end is discussed and applications described for cross citation retrieval, toponym extraction and plotting, automatic hypertext generation, morphology, and word co-occurrence... One of the greatest challenges in building and maintaining a large, heterogeneous DL (digital library) is the necessity of managing documents with widely varying encodings and markup practices. Although the World Wide Web has demon-strated the power of simple links among simple documents, the benefits of more highly structured markup have long been understood. The Perseus digital library project has developed a generalizable toolset to manage XML (Extensible Markup Language) documents of varying DTDs (Document Type Definitions); to extract structural and descriptive metadata from these documents and deliver document fragments on demand; and to support other tools that analyze linguistic and conceptual features and manage document layout. In over ten years of creating and managing SGML and now XML data, we have been greatly helped by the generality and abstraction of structured markup, which has allowed us to deliver our content smoothly on a variety of platforms, from standalone CDROMs, to custom client/server software, to the World Wide Web. In digitizing historical and scholarly documents, we have also come to appreciate the richness of the implicit and explicit links among printed resources. Our DL system reifies these connections and tries to meet the challenges of automatically generating hypertexts in electronic media. Most often needed in creating a rich hypertext across a digital library are models of the structure of individual documents and descriptions of their content. These models ought to be independent of the particular encodings of those documents. Use of these abstractions allows rapid development of scalable tools for display, linguistic analysis, knowledge management, and information retrieval within the DL system. We describe an engine to leverage the power of XML for this modeling task and some of its applications in building a hypertextual digital library. This document management system is the back end for a production web server that delivers over 2 million pages a week; it went into production early in March, 2000... We have described an XML document management system for a digital library. This system facilitates development of knowledge management applications including those for display, feature extraction, and automatic hypertext genera-tion. Our DL system facilitates development of these and other applications because it releases the application programmer from the task of indexing collections of documents written in multiple DTDs. Because the modules scale as the DL grows, documents in the integrated DL become more valuable than those existing in isolation." A preliminary version of this article was published in connection with the Extreme 2000 conference paper, "The Management of XML Documents in an Integrated Digital Library" [PDF], or HTML, [cache]
"The Perseus Project and Beyond: How Building a Digital Library Challenges the Humanities and Technology." By Gregory Crane. In D-Lib Magazine (January 1998). "For more than ten years, the Perseus Project has been developing a digital library in the humanities. Initial work concentrated exclusively on ancient Greek culture, using this domain as a case study for a compact, densely hypertextual library on a single, but interdisciplinary, subject. Since it has achieved its initial goals with the Greek materials, however, Perseus is using the existing library to study the new possibilities (and limitations) of the electronic medium and to serve as the foundation for work in new cultural domains: Perseus has begun coverage of Roman and now Renaissance materials, with plans for expansion into other areas of the humanities as well. Our goal is not only to help traditional scholars conduct their research more effectively but, more importantly, to help humanists use the technology to redefine the relationship between their work and the broader intellectual community. Data Structuring and Conversion: This is by far the most important job that we do, since the underlying organizing of any information constrains what people can and cannot do with it. The relational databases, TEI conformant SGML texts, images, and other well defined products will outlast any given delivery system. Much of our work has gone into structuring data -- whether that data has been created for this project or derives from a preexisting source (e.g., print text, archaeological plan or drawing, existing slides). This work can range from standard database programming to elaborate analysis of preexisting data. The latter work can be challenging but immensely productive: we have been able to infer enough of the underlying (and thus unmarked) structure from complex reference works (e.g., a 40 mbyte Greek-English Lexicon with more than 500,000 source citations) so that the electronic version becomes, in effect, a fundamentally more useful work than its print counterpart."
"The Perseus Project: A Digital Library for the Humanities." By David A. Smith, Jeffrey Alan Rydberg-Cox, and Gregory Crane. In Literary and Linguistic Computing Volume 15, Number 1 (2000), pages 15-26.
"Introduction to Structured Markup." A two-part reference on markup, with examples from the Perseus Project. From 'The Stoa: A Consortium for Electronic Publication in the Humanities'. See (1) Marking Up a Text and (2) Markup for Philoctetes.
"Electronic Homer." By Martin Mueller. In Ariadne [ISSN: 1361-3200] Issue 25 (September 2000). ['Martin Mueller on the options for reading Homer electronically with the TLG, Perseus, and the Chicago Homer.]' "The Perseus Project provides a bilingual text-and-dictionary web site that provides access to a large chunk of classical and Hellenistic Greek texts...The text-and-dictionary environment of Perseus makes its archive much more accessible to a reader with less expertise in the discipline or the technology. Perseus offers a very special digital environment, currently unmatched for any other substantial linguistic corpus. It contains a large chunk of the surviving texts from archaic, classical, and Hellenistic Greece, many of them derived from the TLG, and all of them accompanied by English translations. Every wordform in the Perseus corpus contains its possible morphological descriptions, and through this morphological parser it is linked to the lemmata or dictionary entry forms in Liddell-Scott-Jones (LSJ), the most authoritative dictionary of ancient Greek. All the citations in the dictionary are in turn linked back to the Perseus corpus. The English equivalent of this would be a digital corpus in which any wordform in much of the literature from Chaucer to Joyce is linked directly to its lemma in the Oxford English Dictionary..."
"Explicit and Implicit Searching in the Perseus Digital Library." By Anne Mahoney (Perseus Project / Tufts University). Presented at Information Doors, A workshop held in conjunction with the ACM Hypertext and Digital Libraries conferences, May 30th 2000, San Antonio, Texas, USA. Also in HTML. [cache]
[October 26, 2000] "The Symbiosis Between Content and Technology in the Perseus Digital Library." By Gregory Crane, Brian Fuchs, Amy C. Smith, and Clifford E. Wulfman (Perseus Digital Library Project). In Cultivate Interactive Issue 2 (16-October-2000). "The Perseus Digital Library already enjoys strong affinities with many projects being developed in Europe today. Mirror sites for Perseus have been maintained in Oxford and Berlin for several years, and we have worked extensively with the Max Planck Institute for the History of Science, Berlin since 1998. Most recently, we have begun to collaborate with the Center for the Study of Ancient Documents and the Beazley Archive at Oxford University as well as with the team at Cambridge now writing a new intermediate Greek Lexicon. European collaborations are natural for us; while most of the technical research in digital libraries being done in the US is readily applicable to European efforts, the Perseus Digital Library Project is unusual in that, technology aside, its efforts to date have focused on a cultural heritage shared by the US and Europe alike. Given the magnitude of the task before us all, such US/European partnerships are essential, and we are eager to expand our ties to colleagues in Europe."
Earlier entry: Perseus Project
"Dēmos: Challenges and Lessons." By Christopher W. Blackwell. In Classics@: The Electronic Journal of the Center for Hellenic Studies of Harvard University Volume 02 (2004), edited by Christopher Blackwell and Ross Scaife. See the project website.


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY