OASIS Unstructured Information Management Architecture (UIMA) TC

The original Call for Participation for this TC may be found at http://lists.oasis-open.org/archives/uima/200610/msg00000.html.

The charter for this TC is as follows.

Name

OASIS Unstructured Information Management Architecture (UIMA) TC.

Statement of Purpose

Unstructured information represents the largest, most current and fastest growing source of knowledge available to businesses and governments worldwide. The web is just the tip of the iceberg. Consider the droves of corporate and technical documentation ranging from best practices, research reports, medical abstracts, problem reports, customer communications and contracts to emails and voice mails. Beyond these consider the growing number of broadcasts containing audio, video and speech. In these mounds of natural language, speech and video artifacts often lay nuggets of knowledge critical for analyzing and solving problems, detecting threats, realizing important trends and relationships, creating new opportunities or preventing disasters.

— Shaving off just seconds per call to find the right technical documentation in call-centers can save millions of dollars.

— Rapidly detecting emerging trends in problem-reports coming in from all over the globe can avoid recalls and save companies and their customers millions if not billions.

— Analyzing SEC reports to help evaluate corporate financial positions.

— Automating the analysis, segmentation and restructuring of educational content to better serve changing skill sets or new learning objectives can save many hours and can better enable just-in-time learning for critical tasks.

— Detecting otherwise unrealized drug interactions through analyzing the linkages buried in millions of medical abstracts can help prevent disaster as well as help discover new drugs or cures.

— Analyzing communications linked to terrorist networks in the form of multi-lingual text, speech or video can help uncover plots threatening national security before they happen.

These are just a few of the applications that can benefit from the exploitation of unstructured information.

Applications like these, which rely on the rapid discovery of vital knowledge, require the analysis of unstructured information. This is all the information that has NOT been carefully encoded in enterprise databases but rather exists as natural language text, speech or video. These applications rely on the rapid assignment of semantics to huge volumes of unstructured content exactly so that this content may be structured and exploited by traditional application infrastructure (e.g., database management systems, knowledgebase systems, information retrieval systems, etc.).

Unstructured information may be defined as the direct product of human communication. Examples include natural language documents, email, speech, images and video. It is information that was not specifically encoded for machines to process but rather authored by humans for humans to understand. We say it is "unstructured" because it lacks explicit semantics ("structure") required for applications to interpret the information as intended by the human author or required by the end-user application.

Unstructured information may be contrasted with the information in classic relational databases where the intended interpretation for every field data is explicitly encoded in the database by column headings. Consider information encoded in XML as another example. In an XML document some of the data is wrapped by tags which provide explicit semantic information about how that data should be interpreted. An XML document or a relational database may be considered semi-structured in practice, because the content of some chunk of data, a blob of text in a text field labeled "description" for example, may be of interest to an application but remain without any explicit tagging-that is, without any explicit semantics or structure.

For unstructured information to be processed by traditional applications, it must be first analyzed to assign application- specific semantics to the unstructured content. Another way to say this is that the unstructured information must become "structured" where the added structure explicitly provides the semantics required by target applications to interpret the data.

An example of assigning semantics includes wrapping regions of text in a text document with appropriate XML tags that might identify the names of organizations or products. Another example may extract elements of a document and insert them in the appropriate fields of a relational database or use them to create instances of concepts in a knowledgebase. Another example may analyze a voice stream and tag it with the information explicitly identifying the speaker.

A simple analysis on documents may, for example, scan each token in each document of a collection to identify names of organizations. It may insert a tag wrapping and identifying every found occurrence of an organization name and output the XML that explicitly annotates each with the appropriate tag. An application that manages a database of organizations may now use the structured information produced by the document analysis to populate a relational database.

In general, we refer to the act of assigning semantics to a region of some unstructured content (e.g., a document) as "analysis". A software component or service that performs the analysis is an "analytic".

The semantics are captured by an analytic as structure metadata elements. So *analytics* implement operations that produce structured metadata elements describing regions of the unstructured content which they analyze. The generated metadata may be represented in many different ways including as XML tags.

We refer to systems that perform analysis on unstructured information as "Unstructured Information Management (UIM) applications."

UIM applications tend to be highly decomposable; that is, they may be broken down into many fine-grained *analytics*. Each of these performs some constituent function in an overall analysis flow.

Analytics and Analysis Frameworks

Analytics may be reused in different flows to perform different aggregate analyses. Even in our simple example above, a first, very common function, in the overall process is to tokenize the document (identify each individual word). This tokenization function may be reused as a first step in many different analysis tasks for many different applications.

Many software frameworks have been developed in support of building and integrating component analytics (e.g., Gate, Catalyst, Tipster, Mallet, Talent, Open-NLP, etc.). However, no clear standard has emerged for enabling the interoperability of analytics across modalities (text, audio, video, etc.), frameworks and programming platforms in support of developing robust and pluggable UIM applications.

The UIMA Java Framework is an implementation that arguably comes closest to addressing the breadth of these requirements. It was originally developed as part of the UIMA project at IBM Research (http://www.ibm.com/research/uima). It provides a common, object- oriented and extensible means for representing unstructured information and its metadata, a set of basic interface definitions for implementing interoperable analytics and a Java run-time for supporting analytic composition and deployment (of Java and C++ analytics).

The UIMA Java Framework was released in late 2004 as part of the UIMA Software Developers Kit (SDK) on IBM AlphaWorks (http://www.ibm.com/alphaworks/tech/uima). The SDK is freely available and provides the tools and run-time necessary for creating, composing and deploying component analytics. These may be implemented by the developer to analyze and assign semantics to multi-modal data including, for example, combinations of text, audio and video.

In early 2006 IBM contributed the UIMA Java Framework to the open-source community through source forge (http://uima-framework.sourceforge.net/). The open-source will soon be managed in a venue where IBM and non-IBM committers can participate in its collaborative development. Since the framework's posting, there have been over 8000 downloads of the framework by industry, government and academia. It has been included in IBM Information Management products and used in many solutions in areas ranging from life-sciences, to national security to customer relationships management.

The Need for a Standard Specification

The UIMA Java Framework is an implementation tied to a particular programming model and platform. It makes many system level commitments based on a variety of design points. This implementation, however, suggests a more general specification for interoperability that may allow for different framework implementations and different levels of compliance supporting interoperability for a broader range of application and programming requirements.

We propose to develop the UIMA Specification to explicitly define standard data specifications, operation types and communication protocols to facilitate interoperability of analytics at the data and services level.

This level of specification will serve a critical role in helping to facilitate lighter-weight interoperability across a broader spectrum of platforms, programming models, applications and tools for text and multi-modal analytics.

The intent is that the standard will allow different frameworks to emerge, while also allowing applications built on different implementations to have a standard means to share analysis data and services. It will lower the barrier for component and application developers to interoperate at different levels allowing a broader community to discover, reuse and compose a growing body of text and multi-modal analytics.

Scope of the TC's work

The scope of the work of the TC is to generalize from the published UIMA Java Framework implementation and produce a platform-independent specification in support of the interoperability, discovery and composition of analytics across modalities, domain models, frameworks and platforms.

Specifically, the TC is to consider an initial draft contributed by IBM in the Research Report based on the UIMA project entitled "Towards an Interoperability Standard for Text and Multi-Modal Analytics". This report should be used as a straw man to scope, develop and rationalize a formal UIMA specification.

The TC will address three primary tasks

Elements of the Specification
Related Issues and Standards
Higher-Level Documentation

Elements of the Specification

The committee will be charged with evaluating, extending, modifying and refining the proposed eight (8) elements of the UIMA specification. These elements are dependent on other standards including UML, eMOF, eCore, XML Schema, XMI, OCL, WSDL and SOAP.

Common Analysis Structure (CAS) Specification. Provides a simple and extensible typed model for representing analysis data as a standard object model that may be easily instantiated and manipulated in object-oriented programming systems. This element of the specification is provided as a UML model. We propose adopting the XML Metadata Interchange (XMI) specification (http://www.omg.org/docs/formal/03-05-02.pdf) to provide a standard means for representing analysis data as an XML document.
Type System Language Specification. Provides a standard means for associating object model semantics with artifact metadata that complies with object modeling standards. We propose to use Ecore as the Type System language. Ecore is the modeling language used in the Eclipse Modeling Framework and is tightly aligned with the OMG's EMOF standard (http://dev.eclipse.org/viewcvs/indextools.cgi/*checkout*/org.eclipse.emf/doc/org.eclipse.emf.doc/references/overview/EMF.html ). An Ecore Type System is represented as an XMI document to support the XML-based representation and interchange of Type Systems.
Type System Base Model. Provides a standard and extensible set of domain-independent types generally useful for analyzing unstructured information.
The Behavioral Metadata Specification. Provides a standard declarative means for describing the capabilities of analysis operations in terms of what types of CASs they can process, what elements in a CAS they can analyze, and what sorts of effects they would have on CAS contents as a result. Behavioral metadata would be used to assist in the discovery and composition of analytics based on their described function. We propose appealing to the OCL standard (http://www.omg.org/technology/documents/formal/ocl.htm)to represent behavioral metadata.
Analytic Metadata Specification. Provides a standard declarative means for describing identification, configuration and behavioral information about analytics. This specification may be represented as a UML Model from which an XML Schema may be generated. It refers to the Behavioral Metadata Specification to represent an analytic's behavioral information
Aggregate Analytic Metadata Specification. Provides a standard declarative means for an aggregate analytic to: a. refer to its constituent analytics, b. identify a flow controller, which determines the order in which the constituent analytics of the aggregate are invoked on a CAS and c. define mappings to facilitate the composition of independently-developed analytics.
Abstract Interfaces. Abstractly describes the interfaces to the two different types of components or services that developers may implement, namely, Analytics and Flow Controllers. These abstract interfaces may be specified with a UML model.
Service Descriptions and SOAP Bindings. Provide a standard means for implementing Analytics and Flow Controllers as web services using SOAP. This specification may be represented using WSDL (http://www.w3.org/TR/wsdl20/).

Related Issues, Requirements and Standards

In addition, the UIMA TC will be charged with providing recommendations regarding how other requirements should or should NOT be addressed or related to by the UIMA specification including:

CAS representations for efficient stream operations
Representing and Recording Provenance Information
Privacy and Security Issues
General alignment with ontologies and related representational standards including OWL and RDF
Facilities for mapping between metadata models (e.g., XSLT) 6. Support for existing metadata models and their representations (VoiceXML, LegalXML, MPEG-7, etc.)
Componentization, life-cycle management and related standards (e.g., OSGi)
Discovery-services in support of finding analytics based on identification and behavioral metadata
Analytic configuration management

High-Level Documentation

The UIMA TC should produce higher-level documentation to help motivate and promote the UIMA specification as a standard that may include use-cases, case-studies and high-level architectural descriptions but excludes detailed formalizations.

Out of Scope

Finally, the UIMA TC will NOT address platform-dependent specifications including the definition programming models or object-oriented APIs, the binding of interfaces to any particular programming language, workflow engines or languages, the implementation or integration of system middleware services to address the scalability, componentization or life-cycle management of framework implementations. The UIMA TC would NOT define any specific domain model (e.g., set of XML tags or types) for marking up unstructured information.

Deliverables

Initial Use Cases — 2Q 2007
The CAS Model — 3Q 2007
The CAS XMI Specification — 3Q 2007
The Type System Language — 3Q 2007
The Type System Base Model — 3Q 2007
Behavioral Metadata — 4Q 2007
Analytic Metadata — 4Q 2007
Aggregate Analytic Metadata — 4Q 2007
Abstract Interfaces — 4Q 2007
Service WSDL Descriptions — 4Q 2007
Recommendations regarding related requirements — 4Q 2007
Appendix: Soap Bindings — 4Q 2007
Appendix: Java Framework Compliance Notes — 4Q 2007
Appendix: Design Patterns — 4Q 2007

Anticipated audience

UIMA Java Framework developers
Text Analysis Vendors
Search and Knowledge Discovery Vendors
Document Management Vendors
Video and Speech Analysis Vendors
Machine Translation Vendors
Government Contractors
US and other Government agencies
R&D in Life-Sciences and BioInformatics
Universities performing research in text & multi-modal analytics

Language

The TC will conduct its business in English. The TC may elect to form subcommittees that produce localized documentation of the TC's work in additional languages.