Autotaggers are a group of products that take `un-tagged' data streams and tag them depending on a set of rules. Two primary techniques are employed. Either, the software produces a document `page' using whatever rules and subconscious `tags' exist within the data stream, then tries to understand what that page should be in terms of the required SGML tags. For example, text that is centred, in bold and at the head of the page could be taken to be a new chapter title, and so on. Autotaggers that apply this principle are more useful for the tagging of scanned material. The second technique is more akin to the general software converters (see section on transformation tools below) and uses whatever `hidden' tags there are within the input data stream to indicate the start and end of textual and graphical constructs. Autotaggers, in general, come into their own when they are required for the conversion of large numbers of archived documents which may or may not be available in electronic form.
Features include a windowed interface; style-based recognition; support for embedded graphics and the ability to read in a compound document, text and graphics; the ability to direct output to more than one file at a time; eg. graphics to one file, text to another, and tables to another; greater cross- referencing capability; treatment of tables as a text object category that can be named and described.
It works by means of a Visual Recognition Engine, which builds in memory a representation of each page in the input file and which then attempts to recognise the different objects (tables, graphics, paragraphs, lists, headers, etc) which appear on the page. How it recognises some types of object is hard-coded into the system (text objects, tables and graphics). Other object recognition rules are provided by the user through a declarative language (INSPEC), which allows the user to separate out different types of text objects such as paragraphs with different indents, lists, headings, etc. The user specifies how the recognised objects should be processed for output via a procedural language (LOUISE).
The great strengths of FastTAG are that:
It is generally a powerful and useful tool, though there are some restrictions imposed because of both the theoretical approach of "Visual Recognition" and of the ways in which the product has been implemented.
One restriction is fairly obvious that the input file must be in one of the formats supported by the product. It currently supports most of the major WP formats, though the suppliers should be contacted for an up-to-date list. Other restrictions are imposed by functional limitations in the LOUISE language used for specifying the processing actions to be performed, and in the way that recognition of certain types of objects is hard-coded into the system. The distinction between simple tables and lists can sometimes be subtle, and if FastTAG decides that something is a table then it must be manipulated by the functions available in LOUISE for handling tables, even if the user would prefer to treat it as a list (or other text object) for which there are more extensive handling functions available.
There is also no SGML parser within FastTAG, so if it is being used to create SGML files then a post-process parsing pass is also required. The accuracy of the produced SGML files is then a function of how well the LOUISE program has been written and how consistent the input file is. There are no inbuilt functions for helping the user to ensure that the produced output file is indeed valid SGML, or for keeping track of the SGML structure. In order to make structure- dependent output decisions it is necessary to implement your own stack management routines in LOUISE.
Error reporting is not as comprehensive as it could be, particularly syntax errors within the LOUISE file. As an example, a common keyboard error is to use the assignment operator "=" when it is really meant to use the relational equality operator "==", for example in:
if (current_element = "xxxx")Such a mistake results in the error message
Error: Failure in markup stage (Read error on LOUISE intermediate file)No output file or error file is created, so it can be extemely difficult to track down the source of such errors in large program files.
The user documentation, however, is comprehensive, comprising of a User's Guide, Reference Guide, Tutorial Guide and Advanced Topics Guide. Although comprehensive, the documentation is not always too clear. For example, the subtle distinction between "TITLE EXISTS specifies whether the text object must or must not have a title" and "TITLE MUST EXIST specifies whether the text object is required to have a title" is beyond this user.
If the input files are in a format supported by FastTAG then its use can save a great deal of time over alternative approaches to document conversion. If the input files are in a non-supported format, such as TeX, or in a structure-based rather than layout-based markup language (such as SGML) then FastTAG is not the most suitable tool for the job and an alternative approach based upon a lexical analyser linked to an SGML parser is likely to prove more useful and poweful.