Global Information Management Metrics Volume (GMX-V) 1.0 Specification

Draft 0.7 21/02/2006

Do not show changes

Show changes since version 0.6

This version:
http://www.xml-intl.com/docs/specification/GMX-V.html
Editor:
Andrzej Zydroń <azydron@xml-intl.com>
David Walters <waltersd@us.ibm.com>
Copyright © The Localization Industry Standards Association [LISA] 2005. All Rights Reserved.

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to LISA.

The limited permissions granted above are perpetual and will not be revoked by LISA or its successors or assigns.


Abstract

This document defines the LISA Global Information Management Metrics eXchange Volume (GMX-V) specification . The purpose of this vocabulary is to define the metrics that allow for the unambiguous sizing of a given Global Information Management task. GMX-V is one of the tripartite Global Information Management standards which encompass volume (GMX-V), complexity (GMX-C) and quality (GMX-Q).

Status of this Document

This document constitutes an initial draft for discussion.

This document and the information contained herein is provided on an "AS IS" basis and LISA DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Table of Contents

1. Introduction
2. Key Concepts
2.1. Text Unit
2.2. Canonical Form
2.3. Unicode (ISO 10646)
2.4. Word Boundaries
2.5. Verifiable and Non-Verifiable Metrics
2.6. Inline Element Transparency
2.7. White Space Characters
2.8. Words
2.9. Characters
2.10. Punctuation Characters
2.11. Inline Element Counts
2.12. Linking Inline Elements
2.13. Logographic Scripts
2.14. Qualitative Text Unit Categorization
2.15. Unqualified Text Units
2.16. Translatable Text Counts
2.17. XML Entity References
2.18. User Defined Entity References
2.19. Auto Text
2.20. Repetition Counts
2.21. Current Commercial Practice
3. Counts
3.1. Word Count Categories
3.2. Auto Text Word Count Categories
3.3. Character Count Categories
3.4. Auto Text Character Count Categories
3.5. Inline Element Count Categories
3.6. Linking Inline Element Count Categories
3.7. Text Unit Counts
3.8. Other Count Categories
3.9. Conformance
3.10. Validation
4. General Structure
4.1. Metrics Element
4.2. Stage Element
4.3. Notes Element
4.24. Count Group Element
4.35. Count Element
5. Detailed Specification
5.1. GMX-V Namespace Declaration
5.2. Elements
5.2.1. Main Metrics Element
5.2.2. Stage Element
5.2.3. Notes Element
5.2.24. Count Group Element
5.2.35. Count Elements
5.3. Attributes
5.3.1. GMX-V Attributes

Appendices

A. GMX-V Document Structure
B. GMX-V Document Type Definition and Schema
C. References
D. Glossary

1. Introduction

GMX-V addresses the issue of quantifying the workload for a given localization or translation task. This is often commonly referred to as word counts. Word counts, however, do not convey the true range of possible metrics that can be used to assess the cost of localizing a document such as the number of screen shots for a software localization project, or page counts for a document layout task. GMX-V is a more precise definition of the metrics required for billing and sizing purposes.

In defining GMX-V, care has been taken to provide a definitive and unambiguous definition that also offers the widest possible scope for achieving an adequate description of the task load for a given Global Information Management project.

Metrics fall into two categories: directly verifiable from the file contents (words and characters) and unverifiable (pages, file units, lines, etc.). Some metrics may fall into both categories, depending on the circumstances. For instance, page counts may be verifiable from the file contents under some circumstances; however, under other circumstances, only a printout for a given page format can provide the basis for a count.

GMX-V does not preclude the existence of unverifiable counts, but is concerned with defining precise rules for verifiable counts based on the file unit that is being counted.

To this end, it is proposed that GMX-V cover both word and character counts, as well as allowing for other relevant count categories that cannot be verified electronically. Character counts convey the most precise definition of a translation task, whereas word counts are the most commonly used metric in the translation industry. GMX-V should encompasses both measurements, thus affording both translation suppliers and customers with a choice as to which measurement most adequately reflects the translation task in question as well as allowing for other relevant metrics.

From the implementation point of view GMX-V is designed to co-exist as a namespace within other XML documents. The main target will primarily be XLIFF and Translation Web Services WSDL compliant documents, but any XML document that allows namespace extension points could host GMX-V namespace.

GMX-V has therefore the following aims:

  1. To provide an unambiguous specification for counting words and characters for translation related tasks.
  2. To provide a rich set of qualifiers to help accurately define the actual translation workload for translation related tasks.
  3. To provide an XML notation for exchanging Global Information Management metrics for any Global Information Management task whether it entails translation activity or not.

2. Key Concepts

The following concepts are fundamental to the GMX-V specification:

2.1. Text Unit

A document is made of a number of text units. A text unit is either a stand alone piece of text within a document, or a subdivision of a stand alone piece of text into recognizable sentences. Document metrics will be based on an accumulation of the word and character statistics of the individual text units . Any segmentation should be detailed in an SRX (Segmentation Rules eXchange format) compliant document. A separate count of text units can be maintained within the GMX-V specification (see 3.7. Text Unit Counts).

2.2. Canonical Form

A precise canonical (base) form for individual text units is required to provide an accurate and unambiguous basis for conducting metrics. Native forms of the text are often encumbered with extraneous proprietary formatting codes, which make the production of unambiguous statistics difficult.

The XLIFF (XML Localization Interchange File Format) normalized XML form using Unicode encoding of the source language text shall be used as the basis for the canonical form for GMX-V. XLIFF is an OASIS standard. This is the format in which the text for a segmented sentence or stand alone piece of text appears in the <source> element of an XLIFF file <trans-unit> element. For audit purposes an XLIFF file is required for GMX-V metrics. The GMX-V count is undertaken on the basis of this XLIFF document.

Example:

   <source>An example of the canonical form of a text unit.</source>
   

GMX-V allows metrics to be produced directly from non-XLIFF format files, as long as the format for counting is based on the XLIFF canonical form for each text unit being counted.

The canonical form does not contain any embedded formatting characters, such as those that exist in an XLIFF document extracted from a RTF file. Any such characters must be removed to produce the canonical form. In addition any formatting characters representing a space must be converted to the standard SPACE character (U+0020).

Original XLIFF <source> element with embedded RTF codes:


   <source>The <bpt i="1" x="1">{\b </bpt>black<ept i="1">}</ept><bpt i="2" x="2">{\i </bpt>
   cat<ept i="2">}</ept> eats.</source>

   

Canonical form:


   <source>The <bpt/>black<ept/><bpt/> cat<ept/> eats.</source>
   

Sub flow text within place holder elements needs to always be preserved:


    <source>Start<bpt id="2">code<sub>Text</sub></bpt>end<ept id="2">code</ept></source>
	

Canonical form:


    <source>Start<bpt/><sub> Text </sub><bpt/> end<ept/></source>

2.3. Unicode (ISO 10646)

Unicode Version 3.2 forms the fundamental basis for the XLIFF canonical form for character encoding and for establishing word boundaries. Apart from the ISO 10646 two Unicode Technical Reports are used for establishing the canonical form:

Unicode Standard Annex #29 - Text Boundaries (TR 29-9)
Unicode Standard Annex #15 - Unicode Normalization Forms (TR 15) - Normalized Form C

Unicode TR 29 establishes the word boundaries that allow for words and characters to be counted. TR 15 establishes the actual canonical form for Unicode characters themselves. Normalized Form C is the form mandated by W3C for XML documents and can normally be taken for granted during any conversion from non-Unicode encoding form to Unicode using industry standard programming libraries.

2.4. Word Boundaries

Word and character counts are governed by Unicode TR 29 Version 4.1.0 - Text Boundaries, Section 4 Word Boundaries, which in turn relies on the Unicode TR 29 Version 4.1.0 - Text Boundaries, Section 3 Grapheme Cluster Boundaries rules. This standard unambiguously defines words as opposed to stand alone punctuation, white space or enclosing punctuation characters. All word and character counts will be on the basis of the Unicode TR 29 Version 4.1.0 - Text Boundaries, Section 4 Word Boundaries.

A full definition of the application of Unicode TR 29 Word Boundaries to the GMX-V specification is provided in Section 2.8

2.5. Verifiable and Non-Verifiable Metrics

Not all GMX-V metrics can be strictly defined or verified. Verifiable metrics can be defied for a electronic document in XLIFF canonical form. Non-verifiable metrics require a mechanism such as manual counting to establish its accuracy.

Non-verifiable metrics are not subordinate in any way to verifiable metrics; it is only that they cannot be proven on the basis of a given electronic document.

2.6. Inline Element Transparency

For word and character counts, the code for any inline elements (either empty or having content) within the canonical XLIFF representation will be treated as being totally transparent, that is, they should are be treated as not being present. Inline elements will be counted separately. This is detailed in the section 2.11 Inline Element Counts below.

Example:


   <source>In this <g id="g1">exa<x id="x1"/>mple</g> the in-line codes do not form 
   part of the word or character counts but are counted separately</source>
   

would be counted as:


   <source>In this example the in-line codes do not form 
   part of the word or character counts but are counted separately</source>
   

In the canonical form, any inline codes that signify a space or new line character will must be automatically preceded by a space character, if the space character were otherwise not present in the canonical XLIFF form. If an inline element has spacial characteristics, then it is up to the program that is generating the XLIFF file for GMX-V purposes to insert a space before the inline element if one does not exist. GMX-V is totally agnostic regarding how the XLIFF file is created and can imply nothing regarding inline elements, other than that they have no spacial characteristics.


   <source>The HTML break element <x id="x1" ctype="x-html-br"/>represented here by the in-line "x" element was 
   not preceded by a space in the original document.</source>
   

A separate "Inline Element" count is maintained for inline elements.

2.7. White Space Characters

Unicode white space characters are not counted in GMX-V.

The following list defines white space characters:

  • Unicode space characters (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but not non-breaking space ('\u00A0', '\u2007', '\u202F').
  • '\u0009', HORIZONTAL TABULATION.
  • '\u000A', LINE FEED.
  • '\u000B', VERTICAL TABULATION.
  • '\u000C', FORM FEED.
  • '\u000D', CARRIAGE RETURN.
  • '\u001C', FILE SEPARATOR.
  • '\u001D', GROUP SEPARATOR.
  • '\u001E', RECORD SEPARATOR.
  • '\u001F', UNIT SEPARATOR.
  • 2.8. Words

    Words form the basic unit for counting for the GMX-V specification. The character count is also based on identified words, with the exception of scripts that do not use space word separation . Word separation is described in this section

    Words are defined according to Unicode TR 29 Version 4.1.0 - Text Boundaries, Section 4 Word Boundaries, which in turn relies on the Unicode TR 29 Version 4.1.0 - Text Boundaries, Section 3 Grapheme Cluster Boundaries rules. Unicode TR 29 Section 4 defines detailed Boundary Property Values and Boundary Rules which distinguish words from other grapheme clusters such as punctuation characters. These form an integral part of the GMX-V specification.

    The following example, taken from Unicode TR 29, shows an example of the identification of grapheme boundaries:

    Example 1: Word Boundaries
    The   quick   ( " brown " )   fox   can't   jump   32.3   feet ,   right ?

    Followed by the extracted words:

    Example 2: Extracted Words
    The quick brown fox can't jump 32.3 feet right

    In addition Unicode TR 29 Section 4 provides an optional rule for the apostrophe character which relates to French and Italian usage such as "l'objectif". This rule known as "Break between apostrophe and vowels (French, Italian)" must also be applied for GMX-V. Apostrophe includes U+0027 (') APOSTROPHE and U+2019 (’) RIGHT SINGLE QUOTATION MARK (curly apostrophe).

    Thai, Lao, Khmer, Mynmar, Chinese, Japanese and Korean scripts are exempt from the word rules. See Section 2.13. for details of how these scripts are treated within the GMX-V standard.

    Hyphen characters will not be treated as word break characters. Hyphens include U+002D HYPHEN-MINUS, U+2010 HYPHEN, U+058A ARMENIAN HYPHEN and U+30A0 KATAKANA-HIRAGANA DOUBLE HYPHEN, and will form part of the character count if they appear as part of a word as in 'Italian-American'.

    No additional tailoring of the Unicode TR 29 Version 4.1.0 - Text Boundaries, Section 4 Word Boundaries rules is permitted in the GMX-V specification.

    Example:

       <source>This sentence has a word count of 9 words.</source>
       <source>This sentence/text unit has a word count of 11 words.</source>
       

    2.9. Characters

    The character count is predicated on the word count detailed in Section 2.8 above. For Thai, Lao, Khmer, Mynmar, Chinese, Japanese, Korean and other scripts that do not use spaces between words, the character counts are based on the non-punctuation grapheme boundaries. For all other scripts the character count is based on the identifiable words. Please refer to Section 2.8 above for a detailed explanation.

    Characters are counted based on Unicode encoding according to Unicode TR 15 - using Unicode Normalization Form C.

    2.10. Punctuation Characters

    Punctuation characters do not form any part of the word counts.

    The only exceptions are the 'hyphen' and 'apostrophe' characters if they appear within a word as in: “can’t”, “aujourd’hui” or “out-of-the-box”.

    The apostrophe character count is qualified for French and Italian as per the Unicode TR 29 Version 4.1.0 - Text Boundaries, Section 3 Grapheme Cluster Boundaries rules "Break between apostrophe and vowels (French, Italian)" rule as described in section 2.8 above. In this rule the 'apostrophe' acts as a word break and is also counted as part of the character count.

    Hyphens include U+002D HYPHEN-MINUS, U+2010 HYPHEN, U+058A ARMENIAN HYPHEN and U+30A0 KATAKANA-HIRAGANA DOUBLE HYPHEN, and will form part of the character count if they appear as part of a word as in 'Italian-American'. Please refer to Section 2.8 above for a detailed explanation.

    A separate count will be maintained for all other punctuation characters in the document.

    2.11. Inline Element Counts

    Inline elements give an indication of the complexity of the localization task. Among inline elements, a separate count will be maintained for elements that reference other elements. An additional count of inline elements will be maintained for each text unit's categorization category detailed below. Inline elements with content will be counted as two inline elements.

    Example:

    
       <source>In this <g id="g1">example</g> 
       the in-line codes do not form part of the word or character counts but constitute a 
       separate inline element count of 2, because the inline element has content.</source>
       
       <source>In this <g id="g1">exa<x id="x1"/>mple</g> 
       the in-line codes do not form part of the word or character counts but constitute a 
       separate inline element count of 3, because we have one element with content and 
       one without.</source>
       

    2.12. Linking Inline Elements

    An additional count of inline elements that link to another text unit will be kept. Linked elements require additional localization effort as the linked text unit needs to be referenced as part of the translation. These elements are also counted as part of the inline element count above.

    Example:

    
       <source>In this <g id="g1" xid="t2">example</g> the in-line element references another trans-unit via the xid attribute - 
       it forms part of the inline element count as well as the linking inline element count.</source>
       

    2.13.Logographic Scripts

    Scripts such as Thai, Lao, Khmer, Mynmar, Chinese, Japanese Korean and any other scripts that do not use spaces between words, are exempt from the word rules detailed in Section 2.8 above. For these languages no subdivision of grapheme boundaries into individual words will be accommodated in this version of the GMX-V specification. The Unicode TR 29 - Text Boundaries, Section 3 Grapheme Cluster Boundaries rules will still apply to distinguish text from punctuation characters for these scripts. These will be used to provide character counts for these scripts. Please refer to Section 2.8 and Section 2.9 above for a detailed explanation.

    Apart from counts based on words, all other counts are relevant for these scripts.

    2.14. Qualitative Text Unit Categorization

    A typical translatable document will contain a variety of types of text units. Some of these will require translation and some will not, while other text units will require only proofing since they have been matched against a leveraged translation memory database. Regardless of the agreement between a translation supplier and a customer, a count for overall text units, as well as translatable and non-translatable, will be provided for both word and character counts. In the following XLIFF based examples the canonical form is that of the <source> element.

    Example:

    
       <trans-unit id="t1" translate="yes">
           <source>This is an example of translatable text</source>
           <target>This is an example of translatable text</target>
       </trans-unit>
       <trans-unit id="t2" translate="no" resname="alphanumeric">
           <source>10AB1024</source>
           <target>10AB1024</target>
       </trans-unit>
       <trans-unit id="t3" translate="no" resname="punctuation">
           <source>-</source>
           <target>-</target>
       </trans-unit>
       <trans-unit id="t4" translate="yes">
           <source>matched sentence</source>
           <target state-qualifier="leveraged-tm">zdanie dopasowane</source>
       </trans-unit>
       

    Additionally, prior to sending the translation project to the translation supplier, the customer may have analyzed the source document against a translation memory in order to retrieve previously-translated segments. In this instance, an additional word count of the segments that are found in the translation memory may be provided

    Example:

       <trans-unit id="t1" translate="yes">
           <source>This text unit has been matched against a leveraged matched database.</source>
           <target state-qualifier="leveraged-tm">To zdanie zostalo dopasowane z bazy danych.</source>
       </trans-unit>
       

    Certain text units, such as numeric or measurement-only text units, may be converted automatically by software into the target language.

    Example:

       <trans-unit id="t1" translate="no" resname="numeric" reformat="x-numeric-format">
           <source>10,000.00</source>
           <target>10.000,00</target>
       </trans-unit>
       <trans-unit id="t2" translate="no" resname="measure" reformat="x-numeric-format">
           <source>10.50 mm</source>
           <target>10,50 mm</target>
       </trans-unit>
       

    It is up to the translation supplier and customer to agree on the exact nature and type of non-translatable text units. Text unit categorization will not be mandated, merely offered in the standard as an option.

    2.15. Unqualified Text Units

    Unqualified text units are text units that require translation. Any text unit that is not qualified as per Section 2.14. and has no exact or leveraged match is deemed to be unqualified.

    This classification is important as any auto text (see Section 2.19. Auto Text) or inline linking (see Section 2.11. Inline Element Counts) and non-linking (see Section 2.12. Linking Inline Elements) counts are only applied to unqualified text units.

    2.16. Translatable Text Counts

    One of the main aspects of this standard is to produce an unambiguous industry accepted figure for the total quantity of words and characters within an electronic document. For translation tasks the minimum required by the specification, is that a volume metric is produced for words and characters that includes the following:.

    Total Word Count
    This includes all words as defined in Section 2.8 .
    Total Character Count
    This the character count for all words as defined in Section 2.9 .

    Over and above this minimum conformance for translation tasks (see 3.9. Conformance) required by the specification, it is up to the customer and supplier to decide how the other count categories will be applied to the translation task at hand. The specification provides a flexible and comprehensive vocabulary for customizing this calculation, but does not attempt to mandate any one given solution.

    2.17. XML Entity references

    The built-in XML entity reference characters &lt; &gt; &amp; &quot; and &apos; will be counted as a single character for character counting purposes.

    2.18. User Defined XML Entity references

    The XLIFF canonical form will be used for the resolution of any user defined entities. Any user defined entities will have to be resolved in full within the canonical form.

    2.19. Auto Text

    Within a document it is possible to identify text segments that can be handled automatically. Items such as numeric values, e.g. 10 or 10,000.00, measurement units e.g. 10.5 mm, standard phrases or acronyms e.g. WYSIWYG or trade names e.g. "Weapons of Mass DestructionTM".

    It is possible to maintain word and character counts for such treatable text. This can be used to identify and charge for these categories at a different price. The categories will follow closely those defined in section 2.14. Qualitative Text Unit, although they will apply ONLY to 'unqualified' text units which do not have fuzzy matching.

    Unqualified in this sense relates to text units that are not already covered by a qualitative category. This count category may be referred to as auto text for short.

    2.20. Repetition counts

    Within a document the same unmatched text units may occur multiple times. This fact can be exploited by translation workbench software to automatically populate subsequent repeating text units once the first one has been translated. Subsequent occurrences can therefore be automatically qualified as 'repeat' text in the same manner as leveraged matched text. Auto text metrics can therefore only be applied to the first occurrence of the text unit and not those qualified as 'repeat' text units.

    2.21. Current Commercial Practice

    Current commercial practice varies from product to product. There is no unified method of providing an industry wide set of metrics. The GMX-V specification provides a level of detail which provides an adequate way of reconciling GMX-V with metrics provided by commercial practice which is based on accepted standards such as Unicode TR29-9, SRX and XLIFF.

    The actual makeup of the count is up to the supplier and customer.

    3. Counts

    The following count concepts are fundamental to the GMX-V standard:

    3.1. Word Count Categories

    The following word counts will be provided by symbolic name:

    TotalWordCount
    Total word count - an accumulation of the word counts, both translatable and non-translatable, from the individual text units that make up the document.
    Count categories for non-translatable words.
    This is an accumulation of the word counts from the non-translatable categories of text units within the document. The following possible categories are proposed:
    ProtectedWordCount
    An accumulation of the word count for text that has been marked as 'protected', or otherwise not tranlatable (XLIFF text enclosed in <mrk mtype="protected">).
    ExactMatchedWordCount
    An accumulation of the word count for text units that have been matched unambiguously with a prior translation and thus require no translator input.
    LeveragedMatchedWordCount
    An accumulation of the word count for text units that have been matched against a leveraged translation memory database.
    RepetitionMatchedWordCount
    An accumulation of the word count for repeating text units that have not been matched in any other form. Repetition matching is deemed to take precedence over fuzzy matching.
    FuzzyMatchedWordCount
    An accumulation of the word count for text units that have been fuzzy matched against a leveraged translation memory database.
    AlphanumericOnlyTextUnitWordCount
    An accumulation of the word count for text units that have been identified as containing only alphanumeric words.
    NumericOnlyTextUnitWordCount
    An accumulation of the word count for text units that have been identified as containing only numeric words.
    PunctuationOnlyTextUnitWordCount
    An accumulation of the word count for text units that have been identified as containing only punctuation.
    MeasurementOnlyTextUnitWordCount
    An accumulation of the word count from measurement-only text units.
    W-OtherNonTranslatableTextUnitWordCount
    An accumulation of the word count for text units that have been identified as containing only other user-defined non-translatable words. The actual definition of the content and naming of the attribute is up to the supplier and customer with the one requirement that they begin with the sequence 'W-'. This is an extension mechanism.
    TW-TranslatableTextUnitWordCount
    An accumulation of the word count for text units that have been identified as containing only other user-defined non-translatable words. The actual definition of the content and naming of the attribute is up to the supplier and customer with the one requirement that they begin with the sequence 'TW-'. This is an extension mechanism.

    The actual translatable text count can be obtained by subtracting all of the above categories from the TotalWordCount, with the exception of LeveragedMatchedWordCount, FuzzyMatchedWordCount and RepetitionMatchedWordCount. These last three categories can be used to qualify the translation count itself.

    3.2. Auto Text Word Count Categories

    The following auto text categories are applicable to text from unqualified text units with the exception of fuzzy matched text units.

    The following word counts will be provided by symbolic name:

    SimpleNumericAutoTextWordCount
    An accumulation of the word count for simple numeric values, e.g. 10.
    ComplexNumericAutoTextWordCount
    An accumulation of the word count for complex numeric values which include decimal and/or thousands separators, e.g. 10,000.00.
    MeasurementAutoTextWordCount
    An accumulation of the word count for identifiable measurement values, e.g. 10.50 mm. Measurement values take precedent over the above numeric categories. No double counting of these categories is allowed.
    AlphaNumericAutoTextWordCount
    An accumulation of the word count for identifiable alphanumeric words, e.g. AEG321.
    DateAutoTextWordCount
    An accumulation of the word count for identifiable dates, e.g. 25 June 1992.
    TMAutoTextWordCount
    An accumulation of the word count for identifiable trade marks, e.g. "Weapons of Mass DestructionTM".
    AW-OtherAutoTextWordCount
    Other auto text word counts. The actual naming of the attribute is up to the supplier and customer with the one requirement that they begin with the sequence 'AW-'. This is an extension mechanism.

    3.3. Character Count Categories

    The following character counts will be provided by symbolic name:

    TotalCharacterCount
    An accumulation of the character counts, both translatable and non-translatable, from the individual text units that make up the document. This count includes all non white space characters in the document(please refer to Section 2.7. White Space Characters for details of what constitutes white space characters), excluding inline markup.
    PunctuationCharacterCount
    The total of all punctuation characters in the document excluding inline markup and punctuation characters that DO NOT form part of the word count as per section 2.10. Punctuation Characters
    Count categories for non-translatable characters.
    This is an accumulation of the character counts from the non-translatable categories of text units within the document. The following possible categories are proposed:
    ProtectedCharacterCount
    An accumulation of the character count for text that has been marked as 'protected', or otherwise not tranlatable (XLIFF text enclosed in <mrk mtype="protected">).
    ExactMatchedCharacterCount
    An accumulation of the character count for text units that have been matched unambiguously with a prior translation and require no translator input.
    LeveragedMatchedCharacterCount
    An accumulation of the character count for text units that have been matched against a leveraged translation memory database.
    RepetitionMatchedCharacterCount
    An accumulation of the character count for repeating text units that have not been matched in any other form. Repetition matching is deemed to take precedence over fuzzy matching.
    FuzzyMatchedCharacterCount
    An accumulation of the character count for text units that have a fuzzy match against a leveraged translation memory database.
    AlphanumericOnlyTextUnitCharacterCount
    An accumulation of the character count for text units that have been identified as containing only alphanumeric words.
    NumericOnlyTextUnitCharacterCount
    An accumulation of the character count for text units that have been identified as containing only numeric words.
    PunctuationOnlyTextUnitCharacterCount
    An accumulation of the character count for text units that have been identified as containing only punctuation.
    MeasurementOnlyTextUnitCharacterCount
    An accumulation of the character count from measurement-only text units.
    C-OtherNonTranslatableTextUnitCharacterCount
    An accumulation of the character count for text units that have been identified as containing only other user-defined non-translatable words. The actual definition of the content and naming of the attribute is up to the supplier and customer with the one requirement that they begin with the sequence 'C-'. This is an extension mechanism.
    TC-TranslatableTextUnitCharacterCount
    An accumulation of the character count for text units that have been identified as containing only other user-defined non-translatable words. The actual definition of the content and naming of the attribute is up to the supplier and customer with the one requirement that they begin with the sequence 'TW-'. This is an extension mechanism.

    The actual translatable text count can be obtained by subtracting all of the above categories from the TotalCharacterCount, with the exception of LeveragedMatchedCharacterCount, FuzzyMatchedCharacterCount and RepetitionMatchedCharacterCount. These last three categories can be used to qualify the translation count itself.

    3.4. Auto Text Character Count Categories

    The following auto text categories are applicable to text from unqualified (see Section 2.15 Unqualified Text Units) text units.

    The following character counts will be provided by symbolic name:

    SimpleNumericAutoTextCharacterCount
    An accumulation of the character count for simple numeric values, e.g. 10.
    ComplexNumericAutoTextCharacterCount
    An accumulation of the character count for complex numeric values which include decimal and/or thousands separators, e.g. 10,000.00.
    MeasurementAutoTextCharacterCount
    An accumulation of the character count for identifiable measurement values, e.g. 10.50 mm. Measurement values take precedent over the above numeric categories. No double counting of these categories is allowed.
    AlphaNumericAutoTextCharacterCount
    An accumulation of the character count for identifiable alphanumeric words, e.g. AEG321.
    DateAutoTextCharacterCount
    An accumulation of the character count for identifiable dates, e.g. 25 June 1992.
    TMAutoTextCharacterCount
    An accumulation of the character count for identifiable trade marks, e.g. "Weapons of Mass DestructionTM".
    AC-OtherAutoTextCharacterCount
    Other auto text character counts. The actual naming of the attribute is up to the supplier and customer with the one requirement that they begin with the sequence 'AC-'. This is an extension mechanism.

    3.5. Inline Element Count Categories

    The following counts will be maintained for non-linking inline elements by symbolic name:

    Translatable inline element count:
    TranslatableInlineCount
    The actual non-linking inline element count for unqualified (see Section 2.15 Unqualified Text Units) text units.

    Please refer to Section 2.11. Inline Element Counts for a detailed explanation and examples for this category.

    3.6. Linking Inline Element Count Categories

    The following count will be maintained for inline elements by symbolic name:

    Translatable linking inline element count:
    TranslatableLinkingInlineCount
    The actual linking inline element count for unqualified (see Section 2.15 Unqualified Text Units) text units.

    Please refer to Section 2.12. Linking Inline Elements for a detailed explanation and examples for this category.

    3.7. Text Unit Counts

    GMX-V is predicated on the text unit level of granularity (see 2.1. Text Unit). The text unit count encompasses the total number of identifiable text units in the document being counted . Any segmentation should be detailed in an SRX (Segmentation Rules eXchange format) compliant document.

    TextUnitCount
    The total number of text units.

    3.8. Other Count Categories

    The following other counts can be provided by symbolic name:

    TextUnitCount
    A count of the total number of text units.
    FileCount
    The total number of files.
    PageCount
    The total number of pages.
    ScreenCount
    A count of the total number of screens.
    OC-OtherCountCategories
    Other count categories. The actual naming of the attribute is up to the supplier and customer with the one requirement that they begin with the sequence 'OC-'. This is an extension mechanism.

    Additional count categories can be identified according to the requirements of the particular application. These will often be required for Non-verifiable metrics such as screen shots or physical pages. By their very nature it is not possible to standardize these count categories. Each specific implementation shall define its own custom categories. User defined other count categories will always begin with the sequence 'OC-'.

    3.9. Conformance

    A minimum conformance level will encompass the provision of the following categories of GMX-V for translation related tasks:

    1. TotalWordCount
    2. TotalCharacterCount

    Over and above the minimum level of conformance for translation related tasks it is up to the tool supplier to provide the level of detail that is required from their product.

    It is recommended that the full levels of detail are provided for both word and character counts, although it is acknowledged that this may depend on the capabilities of individual tools. For instance a given tool may not support auto text (see Section 2.19. Auto Text) and so would not be able to support auto text count categories (see 3.4. Automatically Treatable Text Character Count Categories).

    For non translation related Global Information Management tasks there is no minimum level of conformance apart from the need to provide at least one count metric.

    3.10. Validation

    Any measurement standard must have a reference implementation as well as an authoritative body that tests and validates the measuring instruments. In the USA, this is provided by the National Institute of Standards and Technology. In order to be successful, GMX-V must provide for a certification authority that will (1) maintain reference documents with known metrics and (2) provide an online facility to test given XLIFF documents. In this way, both customers and suppliers can be safe in the knowledge that GMX-V provides an unambiguous and reliable way of quantifying a Global Information Management task.

    4. General Structure

    The GMX-V document structure is designed to exist as a namespace so that it can be embedded into any document.

    GMX-V comprises the following elements:

    metrics
    This is the top level element for GMX-V.
    stage
    GMX-V counts can be maintained for each stage in the workflow.
    count-group
    This is the main count group identifier. There are separate count-group elements for verifiable and non-verifiable counts.
    count
    The individual categorization, units of measure and values are declared in count elements.

    The following is an example of a GMX-V instance:

    <metrics:metrics version="1.0" date="2004-12-18T13:06:52Z" source-language="en-GB" tool-name="XYZ Tool" tool-version="1.23">
        <stage phase="initial" date="2004-12-18T13:06:52Z">
            <notes from="auser@company.com">
                Initial count based on source document.
            </notes>
            <metrics:count-group name="non-verifiable">
                <metrics:count type="OC-TestingFiles" value="99"/>
                <metrics:count type="OC-DTPFiles" value="99"/>
                <metrics:count type="ScreenCount" value="99"/>
            </metrics:count-group>
            <metrics:count-group name="verifiable">
                <metrics:count type="TotalWordCount" value="99"/>
                <metrics:count type="TotalCharacterCount" value="99"/>
                <metrics:count type="TranslatableLinkingInlineCount" value="99"/>
    	    </metrics:count-group>
        </stage>
    </metrics:metrics>
     

    4.1. Metrics Element

    The <metrics> element is the top level of the hierarchy. It signals the start of the metrics namespace DOM tree. Its direct children are one or more <count-groupstage> elements. It is possible to maintain metrics for one or more stage of the localization workflow.

    4.2. Stage Element

    The <stage> element is used hold the <count-group> elements for a specific count stage, as well as one or more optional <notes> elements.

    4.3. Notes Element

    The <notes> element is used hold optional comments about the metrics stage.

    4.24. Count Group Element

    The <count-group> element is used to contain verifiable or non-verifiable <count> elements.

    435. Count Element

    The individual <count> elements hold the values of the count and identify the type of count.

    5. Detailed Specification

    5.1. GMX-V Namespace Declaration

    The GMX-V document structure is designed to exist primarily as a namespace so that it can be embedded into any document. It is envisaged that its primary use will be within XLIFF documents and Translation Web Services as a separate namespace.

    The GMX-V namespace declaration has the following form:

      xmlns:metrics="urn:lisa-metrics-tags"
      

    All GMX-V elements are normally prefixed with the GMX-V namespace identifier metrics:.

    5.2. Elements

    Elements <metrics>, <stage>, <notes>, <count-group>, <count>

    5.2.1. Metrics

    The topmost GMX-V element has the following format:

    <metrics>

    The <metrics> element encloses all the other GMX-V elements of the document.

    Required attributes:

    version - the fixed GMX-V current version identifier, currently "1.0".

    date - the date that the GMX-V namespace was created for the document.

    source-language - the language in which the document is authored.

    tool-name - The name of the tool that generated the metrics.

    tool-version - The version identifier of the tool that generated the metrics.

    Optional attributes:

    target-language - the target language for the document. Only relevant if any translation memory matching has been done for a particular target language.

    Contents:

    One or more <count-group> <stage> elements.

    5.2.2. Stage

    The Stage element has the following format:

    <stage>

    Required attributes:

    phase - The identifier for the stage.

    date - the date that the stage count was created.

    Optional attributes:

    NONE

    Contents:

    One or more <count> elements.

    5.2.3. Notes

    The Notes element has the following format:

    <notes>

    Required attributes:

    NONE

    Optional attributes:

    from - The email address or other identifier of the creator.

    date - the date that the notes element was created.

    Contents:

    Comments text, no elements.

    5.2.24. Count Group

    The Count Group element has the following format:

    <count-group>

    Required attributes:

    name - The count group name. This will have two possible values: verifiable or non-verifiable.

    Optional attributes:

    state - The count group state. This can be used to create count groups for different states during translation.

    Contents:

    One or more <count> elements.

    5.2.35. Count

    The Count element has the following format:

    <count>

    Required attributes:

    type - The count type.

    value - The quantity value.

    Optional attributes:

    NONEcategory - The fuzzy match count category, e.g. "93-95".

    Contents:

    EMPTY

    5.3. Attributes

    This section lists the attributes used in the metrics elements. An attribute is never specified more than once for each element. Along with some of the attributes are the "Recommended Attribute Values". Values for these attributes are case sensitive. These lists are purely informative; the goal is to specify a preferred syntax so tools can have some level of compatibility.

    attributes category, date, from, name, phase, source-language, state, target-language, tool-name, tool-version, type, value, version

    5.3.1. GMX-V Attributes

    date

    The date attribute indicates when a given element was created or modified.

    Value description:

    Date in [ISO 8601] Format. The recommended pattern to use is: CCYY-MM-DDThh:mm:ssZYYYYMMDDThhmmssZ 
    Where: CCYY is the year (4 digits), MM is the month (2 digits), DD is the day (2 digits), hh is the hours (2 digits), mm is the minutes (2 digits), ss is the second (2 digits), and Z indicates the time is UTC time. For example:

    date="2002-01-25T21:06:00Z20020125T210600Z"
    is January 25, 2002 at 9:06pm GMT
    is January 25, 2002 at 2:06pm US Mountain Time
    is January 26, 2002 at 6:06am Japan time

    Default value:

    Undefined.

    Used in:

    <stage>

    source-language

    The language for the main <metrics> element.

    Value description:

    A language code as described in the [RFC 3066]. For more information see the section on xml:lang in the XML specification, and the erratum E11 (which replaces RFC 1766 by RFC 3066).

    Default value:

    Undefined.

    Used in:

    <metrics>.

    target-language

    The target language for the main <metrics> element.

    Value description:

    A language code as described in the [RFC 3066]. For more information see the section on xml:lang in the XML specification, and the erratum E11 (which replaces RFC 1766 by RFC 3066).

    Default value:

    Undefined.

    Used in:

    <metrics>.

    version

    The current GMX-V version number.

    Value description:

    The version number of this metrics document:

    Fixed value:

    1.0

    Used in:

    <metrics>.

    from

    The email address or other identifier of the creator of a given notes element.

    Value description:

    The identifier of the creator of this notes element.

    Default value:

    Undefined

    Used in:

    <notes>.

    name

    The name of the count-group.

    Value description:

    Must have the value verifiable or non-verifiable.

    Default value:

    Undefined

    Used in:

    <count-group>.

    phase

    The phase name of the stage.

    Value description:

    Can have the value initial, final or user defined.

    Default value:

    Undefined

    Used in:

    <stage>.

    tool-name

    The identifier of the tool used to create the metrics.

    Value description:

    the name of the tool used to perform the metrics count.

    Default value:

    Undefined

    Used in:

    <metrics>.

    tool-version

    The version identifier of the tool used to perform the metrics count.

    Value description:

    the version identifier of the GMX-V count tool.

    Default value:

    Undefined

    Used in:

    <metrics>.

    category

    the category of fuzzy match. This is the percentage category within which the match falls, e.g. "99-95".

    Value description:

    The fuzzy match category value.

    Default value:

    Undefined

    Used in:

    <count>.

    state

    State - The optional count-group state qualifier. Separate count-group elements can be maintained for the different states of the target elements that correspond to the counted source element content in an XLIFF file.

    Value description:

    The pre-defined values are based on the state attribute values from the XLIFF specification document.

    ValueDescription
    final The count-group for XLIFF trans-units with target elements with a status attribute of 'final'.
    needs-adaptation The count-group for XLIFF trans-units with target elements with a status attribute of 'needs-adaptation'.
    needs-l10n The count-group for XLIFF trans-units with target elements with a status attribute of 'needs-l10n'.
    needs-review-adaptation The count-group for XLIFF trans-units with target elements with a status attribute of 'needs-review-adaptation'.
    needs-review-l10n The count-group for XLIFF trans-units with target elements with a status attribute of 'needs-review-l10n'.
    needs-review-translation The count-group for XLIFF trans-units with target elements with a status attribute of 'needs-review-translation'.
    needs-translation The count-group for XLIFF trans-units with target elements with a status attribute of 'needs-translation'.
    new The count-group for XLIFF trans-units with target elements with a status attribute of 'new'.
    signed-off The count-group for XLIFF trans-units with target elements with a status attribute of 'signed-off'.
    translated The count-group for XLIFF trans-units with target elements with a status attribute of 'translated'.

    In addition, XLIFF user-defined values can be used with this attribute. A user-defined value must start with an "x-" prefix.

    Default value:

    Undefined.

    Used in:

    <count-group>.

    type

    The count type.

    Value description:

    Can have any of the following values, or a user defined type:

    Default value:

    Undefined.

    Used in:

    <count>.

    value

    The numeric value of the count record.

    Value description:

    The value of this count.

    Default value:

    0

    Used in:

    <count>.

     

    A. GMX-V Document Structure

    The following figure shows the possible structure as a tree. Each element is followed by notation indicating its possible occurrence according to the corresponding legend.

    (legend: 1 = one
             + = one or more
             ? = zero or one
             * = zero, one or more)
    
    <metrics>1
    |
    +--- <stage>+
         |
         +--- <notes>?
         |
         +--- <count-group>+
              |
              +--- <count>+
    
    
    +--- <count-group>+
         |
         +--- <count>+
    
    

    B. Document Type Definition and Schema

    C. References

    Normative

    [Unicode 3.2]
    Unicode 3.2.
    [Unicode Standard Annex #29-9]
    Unicode Standard Annex #29 Version 4.1.0 Text Boundaries.
    [Unicode Standard Annex #15]
    Unicode Standard Annex #15 Version 4.1.0 Unicode Normalization Forms.
    [IANA Charsets]
    IANA Names for Character Sets. IANA (Internet Assigned Numbers Authority), Aug 2001
    [ISO 639]
    Codes for the Representation of Names of Languages. ISO (International Standards Organization), Nov 2001.
    [ISO 3166]
    Codes for the representation of names of countries and their subdivisions. ISO (International Organization for Standardization), Jun 2000.
    [ISO 8601]
    Representation of dates and times. ISO (International Organization for Standardization), Dec 2000.
    [RFC 3066]
    RFC 3066 Tags for the Identification of Languages. IETF (Internet Engineering Task Force), Jan 2001.
    [SRX 1.0]
    SRX 1.0 Specification. LISA (Localization Industry Standards association), 24 July 2003.
    [XLIFF 1.1]
    XLIFF 1.1 Specification. OASIS XLIFF Committee Specification, 31 Oct 2003.
    [Translation Web Services]
    Translation Web Services Technical Committee
    [XML 1.0]
    Extensible Markup Language (XML) 1.0 Second Edition. W3C (World Wide Web Consortium), Oct 2000.
    [XML Names]
    Namespaces in XML. W3C (World Wide Web Consortium), Jan 1999.

    Non-Normative

    [ISO]
    International Organization for Standardization Web site.
    [LISA]
    Localization Industry Standards Association Web site.
    [OSCAR]
    OSCAR (Open Standards for Container/Content Allowing Re-use) Web site.
    [OASIS]
    Organization for the Advancement of Structured Information Standards Web site.
    [Unicode]
    Unicode Consortium Web site.
    [W3C]
    World Wide Web Consortium Web site.

    D. Glossary

    DTD
    An XML document can have an associated Document Type Definition (DTD) that specifies the rules for the structure of the document. Several industries have standardized on various DTDs for the different types of documents that they share.
    GIM
    Global Information Management
    GMX-V
    Global Information Management Metrics eXchange.
    OSCAR
    LISA special interest group (Open Standards for Container/Content Allowing Re-use).
    Unicode
    Unicode is the official way to implement ISO/IEC 10646 - universal character encoding standard.
    UTC
    UTC stands for Coordinated Universal Time.
    XLIFF
    OASIS Standard for XML Localization Interchange File Format.
    XML
    eXtensible Markup Language.
    DOM
    W3C Document Object Model.