Section 2.1: Autotaggers

Autotaggers are a group of products that take `un-tagged' data streams and tag them depending on a set of rules. Two primary techniques are employed. Either, the software produces a document `page' using whatever rules and subconscious `tags' exist within the data stream, then tries to understand what that page should be in terms of the required SGML tags. For example, text that is centred, in bold and at the head of the page could be taken to be a new chapter title, and so on. Autotaggers that apply this principle are more useful for the tagging of scanned material. The second technique is more akin to the general software converters (see section on transformation tools below) and uses whatever `hidden' tags there are within the input data stream to indicate the start and end of textual and graphical constructs. Autotaggers, in general, come into their own when they are required for the conversion of large numbers of archived documents which may or may not be available in electronic form.


Product:
FastTAG v1.2
Associated products:
SGML Hammer
Developer:
Avalanche Development Company, Boulder, CO, USA
UK Supplier(s):
Interleaf UK Ltd.
Price:
$3,510 (PC versions), $4,030 (others) plus 20% per annum maintenance
Platforms:
PC/DOS, PC/Windows 3.1, Sun/SPARC and Solaris, popular Unix
Description:
FastTAG 1.2 is a new version of FastTAG, Avalanche's flagship document conversion and management product. FastTAG is used to recognise different kinds of structures in documents, including headings, lists, notes, tables, and other kinds of document objects.

Features include a windowed interface; style-based recognition; support for embedded graphics and the ability to read in a compound document, text and graphics; the ability to direct output to more than one file at a time; eg. graphics to one file, text to another, and tables to another; greater cross- referencing capability; treatment of tables as a text object category that can be named and described.

Assessment:
FastTAG is used for converting documents which exist in one of its supported WP formats into some other user-defined form, which may be SGML or could equally be any other document representation language (TeX, troff, RTF, etc.).

It works by means of a “Visual Recognition Engine”, which builds in memory a representation of each page in the input file and which then attempts to recognise the different objects (tables, graphics, paragraphs, lists, headers, etc) which appear on the page. How it recognises some types of object is hard-coded into the system (text objects, tables and graphics). Other object recognition rules are provided by the user through a declarative language (INSPEC), which allows the user to separate out different types of text objects such as paragraphs with different indents, lists, headings, etc. The user specifies how the recognised objects should be processed for output via a procedural language (LOUISE).

The great strengths of FastTAG are that:

  • It doesn't care how a particular formatted result was achieved. All that matters is what it ended up looking like. So the programmer doesn't have to take all the usual inconsistencies into account. If something looks like a list element (with a bullet and a hanging indent), then it can be classified as such no matter what combination of hard and soft returns, tabs, spaces and indents was actually used to create it.
  • It removes from the user the need to understand the internal format of the input file, which would otherwise be required if a lexical analyser-based conversion program were to be used instead.

It is generally a powerful and useful tool, though there are some restrictions imposed because of both the theoretical approach of "Visual Recognition" and of the ways in which the product has been implemented.

One restriction is fairly obvious — that the input file must be in one of the formats supported by the product. It currently supports most of the major WP formats, though the suppliers should be contacted for an up-to-date list. Other restrictions are imposed by functional limitations in the LOUISE language used for specifying the processing actions to be performed, and in the way that recognition of certain types of objects is hard-coded into the system. The distinction between simple tables and lists can sometimes be subtle, and if FastTAG decides that something is a table then it must be manipulated by the functions available in LOUISE for handling tables, even if the user would prefer to treat it as a list (or other text object) for which there are more extensive handling functions available.

There is also no SGML parser within FastTAG, so if it is being used to create SGML files then a post-process parsing pass is also required. The accuracy of the produced SGML files is then a function of how well the LOUISE program has been written and how consistent the input file is. There are no inbuilt functions for helping the user to ensure that the produced output file is indeed valid SGML, or for keeping track of the SGML structure. In order to make structure- dependent output decisions it is necessary to implement your own stack management routines in LOUISE.

Error reporting is not as comprehensive as it could be, particularly syntax errors within the LOUISE file. As an example, a common keyboard error is to use the assignment operator "=" when it is really meant to use the relational equality operator "==", for example in:

if (current_element = "xxxx")
Such a mistake results in the error message
Error: Failure in markup stage (Read error on LOUISE intermediate file)
No output file or error file is created, so it can be extemely difficult to track down the source of such errors in large program files.

The user documentation, however, is comprehensive, comprising of a User's Guide, Reference Guide, Tutorial Guide and Advanced Topics Guide. Although comprehensive, the documentation is not always too clear. For example, the subtle distinction between "TITLE EXISTS specifies whether the text object must or must not have a title" and "TITLE MUST EXIST specifies whether the text object is required to have a title" is beyond this user.

If the input files are in a format supported by FastTAG then its use can save a great deal of time over alternative approaches to document conversion. If the input files are in a non-supported format, such as TeX, or in a structure-based rather than layout-based markup language (such as SGML) then FastTAG is not the most suitable tool for the job and an alternative approach based upon a lexical analyser linked to an SGML parser is likely to prove more useful and poweful.


Product:
DynaTag
Associated Products:
DynaText, DynaBase, CADLeaf Batch
Developer:
Electronic Book Technologies Inc.
UK Supplier(s):
Database Publishing Systems Ltd
Price:
n/a
Platforms:
Set-up appl. on MS Windows; batch converter on MS Windows and Unix.
Description:
DynaTag is a software tool to convert proprietary word processing documents into DynaText electronic documents and style sheets. DynaTag directly accepts as input the recently developed Rainbow SGML format, as well as a variety of proprietary word-processing file formats (MS Word RTF, WordPerfect RTF, Frame MIF, Interleaf ASCII). Users are provided with an intuitive point and click conversion set-up. The complexities of the SGML syntax are hidden from users. CADLeaf Batch is a graphics conversion tool that processes proprietary graphics files into open, standard graphics formats. Users of CL Batch can select multiple input and output file formats, set options, save log files, and organise the translations into "jobs" that can be scheduled, providing true batch process control.
Assessment:
No assessment is available, as the product was only announced earlier this year, and no deliveries have been made.

Product:
ADEPT•PowerPaste
Associated Products:
ADEPT Series products
Developer:
ArborText, Inc. (USA)
UK Supplier(s):
Texcel UK Ltd
Price:
n/a
Platforms:
Most Unix
Description:
Used in conjunction with front end auto-tagging software, ADEPT•PowerPaste provides a powerful yet flexible mechanism for converting legacy information in a variety of formats into fully-compliant SGML documents. ADEPT•PowerPaste presents the user with side-by-side edit windows, with the source document as output from front-end auto-tagging software in one window and the fully-tagged, destination document in the other. Through intelligent "structure aware" interaction with the window that contains the source document, successive portions of text are pasted into the window containing the destination document with complete SGML tagging added. Intelligence is built into the system to recognize the particular structures defined in the DTD, resulting in a process that is automated to a high degree.
Assessment:
No assessment has been undertaken.