Blog

Making lexicography truly digital: the road to DMLex

By Michal Měchura

If your idea of lexicography is that it is an antiquarian pursuit dominated by printed dictionaries and dusty cardfile indexes, then you haven’t been paying attention. Lexicography, like everything else, has now become digital. The digital transformation has changed the discipline’s entire pipeline, ranging from how dictionaries are made, to how they are consumed by end-users. We don’t talk about “writing” dictionaries any more, we compile them semi-automatically from language corpuses. And when people use a dictionary now, it is almost always in the form of a website or a mobile app. Increasingly, lexicographic products are becoming integrated into the user interfaces of other digital products such as machine translators, writing tools and general search engines.

That doesn’t mean everything is going well though. One thing that has been holding lexicography back from making full use of the digital medium is its over-reliance on outdated data models. Traditionally, when people have been representing the structure of dictionary entries in databases and in software, the data model they would usually go for is a top-down tree structure: a structure in which an entry consists of a headword and a list of senses, a sense consists of a definition followed by one or more example sentences, and so on. This mental model, with its hierarchy of parents and children, is easy to express in XML and other formal notations. But, at the same time, there are things in dictionaries which are difficult to represent in a tree structure, such as entry-to-entry cross-references, or complex multi-parent hierarchies of entries and subentries, senses and subsenses. This often results in overly complex XML schemas, in duplicated data, and in dictionaries which are “IT-unfriendly”, difficult to work with computationally.

This is where LEXIDMA comes in, the Lexicographic Data Model and API Technical Committee in OASIS. We are a group of IT professionals in lexicography who have joined forces to produce a standardised data model for dictionaries which overcomes these difficulties. It’s been a long journey but our brainchild DMLex (“Data Model for Lexicography”) has now reached Community Specification status and we are very proud of what we have come up with.

Ours is obviously not the first attempt to produce a standardised data model or file format for dictionaries. Pretty much all previous attempts can be divided into two groups. On the one hand, we have file formats and XML schemas which firmly belong in the tree-structured tradition, such as the TEI Dictionaries chapter. These are very expressive but come with all the inherent limitations and over-complexity mentioned above. On the other hand, there have been attempts to make lexicographic data more IT-friendly by remodelling dictionaries as graphs or networks, such as Ontolex Lemon (a W3C Community Specification) and Lexical Markup Framework (LMF, an ISO standard). These standards often solve the problem at the expense of lower expressivity elsewhere, so that lexicographers – the “content” people – have been reluctant to use them.

In other words, until DMLex came along, you had a choice between standards which the content people liked but the tech people didn’t, and standards which the tech people liked but the content people didn’t. This is a false dichotomy: it is possible to produce a data model which satisfies the needs of both camps. We believe DMLex is such a model. DMLex has taken inspiration from Ontolex Lemon and LMF in that it has remodelled some (though not all) aspects of dictionary structure as a graph but, in addition to that, DMLex has a rich repertoire of content types for representing pretty much everything you will ever want to have in a digital product worthy of the name “dictionary”.

One more thing that makes DMLex different from previous standards in lexicography is that it is not just a file format. DMLex is first and foremost an abstract data model. The first half of the DMLex specification is devoted to describing the data model at an abstract level, independently of any specific markup language or formalism. The second half then proposes five serialisations of DMLex in XML, in JSON, as a relational database (SQL) and as a Semantic Web triplestore, as well as one lesser-used markup language called NVH. So you could say that DMLex is a top-down standard: first we define the abstract model, then we “implement” it in a couple of specific markup languages and formalisms. This is more implementation-neutral and interoperable than previous standards which tend to be tied to specific notations and metamodels such as XML or the Semantic Web.

Last but not least, DMLex is the outcome of a consensus between lexicographers and technicians from across the industry, uniting people from academia, from public-service institutes and from businesses in several (admittedly mostly European) countries. Getting to this level of consensus has not been easy but we are very happy with what we have produced. The centuries-old discipline of dictionary-building is now ready for the next step in its digital evolution.

About the Author
Michal Měchura is the Chair of the OASIS LEXIDMA TC and a language technologist with two decades of experience building IT solutions for lexicography, terminology and onomastics. He has worked on projects such as the National Terminology Database for Irish, the Placenames Database of Ireland and the New English–Irish Dictionary. He is the author of the open-source dictionary writing system Lexonomy and the open-source terminology management platform Terminologue.