Extended Reference Concrete Syntax (ERCS) From: http://www.allette.com.au/sgml/ercs/text/ercs.txt Extended Reference Concrete Syntax (ERCS) & ISO/IEC 10646-1 (Unicode 1.1) Public Character Entities Nov 30, 1995 ------------------------------------------------------------------------------- The major technical challenge for SGML at the current time is how to support the SGML documents of languages that require more than just ISO 646 (ASCII): East Asian and CJK (China/Japan/Korea) documents in particular. The proposed Extended Reference Concrete Syntaxes (ERCS) address the issues of native-language tagging, tagging for interchange between different character sets, and the representation of extended characters. ------------------------------------------------------------------------------- Contents Which? (Issues) Why? (Goals) What? (Overview) Example of SGML Declaration using ERCS Relationship to ISO/IEC 10646-1:1993 (Unicode 1.1) ISO/IEC 10646-1 Public Character Entities How? (Rules) CJK Issues European Application Restricted Zone "compatibility" characters Vendor's Strategy Key Terms Subset Repertoires and Draft Declarations Status ------------------------------------------------------------------------------- Request for Comments Comments are sought on the ERCS. Email Rick Jelliffe at ricko@allette.com.au. Who? (Acknowledgments) Where? Change Log ------------------------------------------------------------------------------- NOTE: all example SGML declarations are not current: I'll fix them in teh next week or so. Which Issues Does ERCS Address ? November 30, 1995 ------------------------------------------------------------------------------- The existing concrete syntaxes of ISO 8879, including the default Reference Concrete Syntax (RCS), only allow certain ISO 646 (ASCII) characters to be used for markup (tag names, short references, etc.). This is satisfactory for the English language and for many international predefined DTDs such as DocBook and HTML, may be not enough for users and documents of non-Latin-based character-sets. ------------------------------------------------------------------------------- Native-Language Warkup Much of the value of using SGML markup, especially for structure-based searches in hypertext, is that the tag names and other markup can have meaning to the user rather than being cryptic mnemonics. This is most true for SGML documents that contain fielded data. So the provision of native-language tagging is a key facility that SGML will need to supply to be successful. So the best concrete syntax for a given character set is one that does not artificially or gratuitously restrict what characters are available for use as markup. In the absence of other factors, if a character appears in words in the native language, it should be available for use in NAMEs. And similarly, if a symbol character is readily available from the keyboard, it should be available for use in short references. ------------------------------------------------------------------------------- Document Interchange In the CJK nations in particular, there are dozens of character sets. As Ken Lunde's book Understanding Japanese Information Processing (O'Reilly 1993) shows though, there is some order and subset correspondence between them. However, even the largest sets are not enough for many uses; furthermore, character sets are often augmented by user-defined characters. When SGML documents are interchanged between systems with different character sets, only the common characters may be used as markup and attribute values. The excess characters must be transported in (PCDATA and RCDATA data content and CDATA attributes) data content through the use of entity references. This complicates the natural approach for supporting native-language tagging: to merely make all characters available for markup. There are two approaches: * "lowest-common-denominator" just uses the ISO 8879 RCS: this is the only standardized approach at the moment; * "greatest-common-denominator" uses the characters available from the intersection set repertoire of all the character sets in use. ERCS allows a standardized greatest-common-denominator approach. ------------------------------------------------------------------------------- Character Representation Users of smaller CJK character sets often may need to represent other characters, especially Han ideographs (Kanji). ERCS proposes a three-tier system: the document character set characters, plus an ISO 10646 public character entity set to cover most needs, with any other characters being named with respect to national character dictionaries. ------------------------------------------------------------------------------- ISO 8879 SGML Declaration Improvement The SGML Standard ISO 8879:1986 does not support CJK needs conveniently: in fact, it is a hindrance. ERCS gives more experience in what reforms would be useful. I presented the syntax changes suggested by ERCS at WG8 meeting at Broomfield in October 1995, as Australian delegate, and with the support of Japan. This is note N 1815 and is available as an RTF file. ------------------------------------------------------------------------------- Goals of ERCS March 18, 1995 ------------------------------------------------------------------------------- ERCS has the following simple goals: * East Asians should be able to mark up documents using native-language characters and conventions; * East Asian languages should be supported just as simply as English currently is; * East Asians should be able to store and maintain documents in national character sets; * only solutions consistent with existing and proposed standards should be used; * ERCS must be readily comprehensible and implementable by vendors, and meet basic user needs: in particular ISO 8879 should not be made more complicated to understand or use; * a solution should be derived "backwards" from the desirable qualities of SGML documents and markup rather than "forwards" from a consideration of character encoding issues. The proper subject is not "how to handle East Asian character sets" but "how to handle East Asian documents"; * ERCS should be locale-free and character-set independent, and be flexible enough to be useful for national character sets of emerging nationsr;. ERCS is primarily concerned with the needs of East Asian and CJK SGML documents. However, ERCS should also be useful for most non-Latin-based languages: indeed perhaps for most major non-English languages. ------------------------------------------------------------------------------- Overview of ERCS November 30, 1995 ------------------------------------------------------------------------------- The various parts of ERCS are: * a large catalog of characters, giving their SGML character class and roles; * concrete syntaxes based on subsets of the catalog character repertoire; * a large public entity set, based on the catalog character repertoire; * guidelines for vendors for selecting the most useful ERCS syntaxes. ERCS is designed to provide a simple & standard basis for vendors, and to be convenient and powerful for users. In essence, the ERCS says this: "if your document character set has character X, and if you want character X to be used in markup, then this should be its class and role." ------------------------------------------------------------------------------- SGML Declaration using ERCS March 30, 1995 ------------------------------------------------------------------------------- To demonstrate how much ERCS could simplify the SGML declaration, here is an example, using a fictitious system character set that contains all the same repertoire as ISO/IEC 10646-1. As can be seen, it is just as simple as an English-language DTD. %unilatsup; %unilatext2; %uniipa; %unisml; %unidcm; %unibasbrk; %unigrkcop; %unicyr; %uniarm; %uniheba; %uniheb; %unihebb; %uniarab; %uniarabext; %unidev; %uniben; %unigur; %uniguja; %unioriya; %unitamil; %unitelu; %unikann; %unimal; %unithai; %unilao; %unigeo; %unigeoext; %unijamo; %unilatextadd; %unigrkext; %unipunct; %unisupsub; %unibuck; %unicdmfs; %uniquasi; %uninum; %uniarrow; %unimath; %unitech; %unicont; %uniocr; %unienc; %unibox; %uniblock; %unishapes; %unimisc; %dingbats; %unicjksym; %unihira; %unikata; %unibopo; %unijamo-c; %unicjkmisc; %unicjkenc; %unicjk-c; %unihangul; %unihangula; %unihagulb; %unihan; %uniprivate; %unihan-c; %unialpha-r; %uniarab-r; %unihalf-r; %unicjk-r; %unismall-r; %uniarabb-r; %unihalffull-r; %unispecial; There have been some other proposals following on from these, for new character reference mechanisms in SGML, in particular, an external numberic character reference and a CHAR markup declaration. Both these would provide a better solution than the above mechanism. ------------------------------------------------------------------------------- ISO/IEC 10646-1 Public Entity Set June 15, 1995 ------------------------------------------------------------------------------- This is a public entity set to make all ISO 10646-1 1993 BMP (with implementation level 3) characters available for use in any document of any character set. The names are based on the UCS-2 encoding, which is also available as Version 1.1 of the Unicode Character Standard. Hex numbers are used; names are not memorable after the first few hundred, and many Han ideographs do not have descriptive names. The Unicode 1.1 character set has about 34 168 characters: 20 192 of them are named by their code number. The Unicode Consortium publishes lists of characters with representative glyphs that are more convenient for users. The convention for referring to Unicode 1.1 characters is U+xxxx. The ERCS ISO/IEC 10646-1:1993 Public Entity Set version of this character may be invoked with &Uxxxx;, for example &U4000;. The public identifier for this entity set is currently: "-//SPREAD ERCS//ENTITIES ISO/IEC 10646-1:1993 BMP UCS-2//EN" Because this entity set is so large, the intention is that vendors should build it into their products rather than read in the set from a file. They could use the number in the name as an index for example, rather than using the name as a key. Only CJK needs this full set. A subset may be more suitable for the rest of East Asia and other 8-bit countries. If an explicit entity set is required, it can be constructed by repeating the following line 60 000 times (actually, far fewer are needed), incrementing xxxx: The previous form follows that suggested in ISO/IEC SC18 WG9 CD 14755 on the canonical form for displaying characters in the ISO 10646 repertoire when the appropriate glyph is not available. An alternate form could be possible, to fit in for Unicode fans: A longer form of content is possible, using the ISO/IEC 10646/Unicode standard name: Most explicitly, a quasi-formal identifier form is possible: And finally, the character U+FFFD REPLACEMENT CHARACTER, or its alternative "*" could be used. Entity Subsets Including the full set is overkill, especially for most non-CJK application. So subset entity sets have been defined. ------------------------------------------------------------------------------- ISO WG8 N 1816 The above was presented to ISO WG8. An RTF version is available. ------------------------------------------------------------------------------- General Rules for Determining Class and Role Dec 10, 1995 ------------------------------------------------------------------------------- The following are the general rules for classifying characters, in order. 1. All roles assigned by ISO 8879 SGMLREF are kept. 2. ftp://unicode.org/MappingTables/UnicodeData-1.1.4.txt. It assigns a category to every character. DELMCHAR implies usable as a short reference delimiter. Class Lu includes Han ideographs: U+4E00 to U+9FFF and U+F900 to U+FAFF. o Lu = Uppercase Letter = ERCS UCNMSTART o Ll = Lowercase Letter = ERCS LCNMSTART o Lm = Modifier Letter = ERCS UCNMCHAR o Lo = Other Letter = ERCS UCNMSTART o Mn = Non-Spacing Mark = ERCS UCNMCHAR (20D0-20E1 UNMCHAR but deprecated) o Mc = Combining Mark = ERCS UCNMCHAR o Nd = Decimal Number = ERCS UCNMCHAR(future: DIGIT?) o No = Other Number = ERCS UCNMCHAR o Pd = Dash Punctuation =ERCS DELMCHAR o Ps = Open Punctuation =ERCS DELMCHAR o Pe = Close Punctuation =ERCS DELMCHAR o Po = Other Punctuation =ERCS DELMCHAR o Sm = Math Symbol = ERCS DELMCHAR o Sc = Currency Symbol = ERCS DELMCHAR o So = Other Symbol = ERCS DELMCHAR o Zs = Space Separator = ERCS SEPCHAR (+ allowable SHORTREF) o Zl = Line Separator = ERCS SEPCHAR (+ allowable SHORTREF, not to be confused with RE or RS?) o Zp = Paragraph Separator = ERCS SEPCHAR (+ allowable SHORTREF) o Cc = Control or Format Character = U+0000-U+009F ERCS CONTROL U+200C-U+206F ERCS UCNMSTART (+ allowable SHORTREF) o Co = Other Character (e.g. private use) = ERCS SHORTREF o Cn = Non-Character (i.e. not part of Unicode 1.1) = ERCS NSGML 3. Unicode characters in the "restricted" compatibility zone (FB00-FFEF) present many problems. While they are, in some respects, equivalent to other characters in the normal part of ISO 10646, ERCS does not implement any equivalence. They should only be used with caution in markup, especially because of the visual misinterpretation and spelling errors they allow. General Block Results This is a general guide to what characters are most commonly found in each area when the ERCS rules are applied. Start Stop Block Name General Result 0020 007E BASIC LATIN SGMLREF 00A0 00FF LATIN-1 SUPPLEMENT SHORTREF or NAME 0100 017F LATIN EXTENDED-A NAME 0180 024F LATIN EXTENDED-B NAME 0250 02AF IPA EXTENSIONS NAME 02B0 02FF SPACING MODIFIER LETTERS NAME 0300 036F COMBINING DIACRITICAL MARKS NAME 0370 03CF BASIC GREEK NAME 03D0 03FF GREEK SYMBOLS AND COPTIC NAME 0400 04FF CYRILLIC NAME 0530 058F ARMENIAN NAME 0590 05CF HEBREW EXTENDED-A NAME 05D0 05EA BASIC HEBREW NAME 05EB 05FF HEBREW EXTENDED-B NAME 0600 0652 BASIC ARABIC NAME 0653 06FF ARABIC EXTENDED NAME 0900 097F DEVANAGARI NAME 0980 09FF BENGALI NAME 0A00 0A7F GURMUKHI NAME 0A80 0AFF GUJARATI NAME 0B00 0B7F ORIYA NAME 0B80 0BFF TAMIL NAME 0C00 0C7F TELUGU NAME 0C80 0CFF KANNADA NAME 0D00 0D7F MALAYALAM NAME 0E00 0E7F THAI NAME 0E80 0EFF LAO NAME 10D0 10FF BASIC GEORGIAN NAME 10A0 10CF GEORGIAN EXTENDED NAME 1100 11FF HANGULJAMO NAME 1E00 1EFF LATIN EXTENDED ADDITIONAL NAME 1F00 1FFF GREEK EXTENDED NAME 2000 206F GENERAL PUNCTUATION SHORTREF 2070 209F SUPERSCRIPTS AND SUBSCRIPTS SHORTREF 20A0 20CF CURRENCY SYMBOLS SHORTREF 20D0 20FF COMBINING DIACRITICAL MARKS FOR SYMBOLS DATACHAR 2100 214F LETTERLIKE SYMBOLS SHORTREF 2150 218F NUMBER FORMS SHORTREF 2190 21FF ARROWS SHORTREF 2200 22FF MATHEMATICAL OPERATORS SHORTREF 2300 23FF MISCELLANEOUS TECHNICAL SHORTREF 2400 243F CONTROL PICTURES SHORTREF 2440 245F OPTICAL CHARACTER RECOGNITION SHORTREF 2460 24FF ENCLOSED ALPHANUMERICS SHORTREF 2500 257F BOX DRAWING SHORTREF 2580 259F BLOCK ELEMENTS SHORTREF 25A0 25FF GEOMETRIC SHAPES SHORTREF 2600 26FF MISCELLANEOUS SYMBOLS SHORTREF 2700 27BF DINGBATS SHORTREF 3000 303F CJK SYMBOLS AND PUNCTUATION SHORTREF or NAME 3040 309F HIRAGANA NAME 30A0 30FF KATAKANA NAME 3100 312F BOPOMOFO NAME 3130 318F HANGUL COMPATIBILITY JAMO NAME 3190 319F CJK MISCELLANEOUS SHORTREF 3200 32FF ENCLOSED CJK LETTERS AND MONTHS SHORTREF 3300 33FF CJK COMPATIBILITY SHORTREF 3400 3D2D HANGUL NAME 3D2E 44B7 HANGUL SUPPLEMENTARY-A NAME 44B8 4DFF HANGUL SUPPLEMENTARY-B NAME 4E00 9FFF CJK UNIFIED IDEOGRAPHS NAME E000 F8FF PRIVATE USE AREA SHORTREF F900 FAFF CJK COMPATIBILITY IDEOGRAPHS NAME (deprecated) FB00 FB4F ALPHABETIC PRESENTATION FORMS NAME (deprecated) FB50 FDFF ARABIC PRESENTATION FORMS-A NAME (deprecated) FE20 FE2F COMBINING HALF MARKS NAME (deprecated) FE30 FE4F CJK COMPATIBILITY FORMS NAME (deprecated) FE50 FE6F SMALL FORM VARIANTS NAME (deprecated) FE70 FEFE ARABIC PRESENTATION FORMS-B NAME (deprecated) FF00 FFEF HALFWIDTH AND FULLWIDTH FORMS NAME (deprecated) FFF0 FFFD SPECIALS SEPCHAR ------------------------------------------------------------------------------- On Reconciling Gaiji by Short References to a Character Catalogue November 8, 1995 Abstract: Japanese corporations frequently extend the standard character sets. The compatability problems that this causes can be removed for SGML documents by treating these characters as short references to character entities. ------------------------------------------------------------------------------- Introduction Computers can share text data only if they agree on the characters being used. The general method of this is for the computers to use a common character set: national standards bodies promulgate such sets. By default, SGML uses the ISO 646 character set. Outside ISO 646-using countries, this default is not so useful. SGML allows the character set of a document to be explicitly declared in an SGML declaration. If characters are needed in a document that are not found in the document's character set, SDATA character entity references can be used. This allows the document to use names for the characters. These names must be resolved, perhaps by human intervention, into the forms known by each specific system, when the character is required. Gaiji Japanese users sometimes need to add extra characters to the standard character sets. This type of extra character is called a "gaiji" in Japanese and a "user-defined character" in English. Japanese corporations frequently also extend the standard character sets. Gaiji can prevent text sharing between computers of different companies; the computers do not agree on characters, except for the national standard subset. (Gaiji create superset character sets; I am not using the term in any general sense such as "other characters needed but not found", I mean gaiji as actual extensions to registered character sets.) If all the user-defined characters are declared in the SGML declaration as short reference delimiters, and then if short reference maps are defined for them in the DTD, then when the document is parsed, the gaiji will be effectively removed from the document character data! An application will only see standard characters; the gaiji have been replaced by SDATA character entity references. Furthermore, if the character entity references are themselves references to a larger standard character set, in particular to ISO 10646, then the SGML system can resolve the references automatically. In other words, by using a character catalogue, such as ISO 10646, that can be known by all the computers, the situation is reached again where all the computers (i.e. the SGML applications) agree on the characters being used. Data sharing is possible, just as if all computers were using the same character set. Impact The importance of this is that there is no need to deprecate the use of gaiji in SGML document character sets: by treating them as short references to character entity references to characters in a large character catalogue, text sharing is possible entirely from within the current SGML model. So rather than trying to suppress the use of gaiji, a better strategy is to promote the use of ISO 10646 as a character catalogue on all Japanese SGML systems. Of course, gaiji not found in the catalogue cannot be resolved in such a system-independent way; but they will still be SGML SDATA character references entities. ------------------------------------------------------------------------------- Restricted Zone "Compatibility" Characters Zenkaku & Hankaku December 8, 1995 ------------------------------------------------------------------------------- What are they? In ISO 10646-1 BMP UCS-2, there are a few hundred repeated characters, put into the "Restricted Zone" from U=FE30 to U+FFE6. Examples are the halfwidth katakana and fullwidth Latin characters. In Japanese, Shift-JIS and EUC-J encodings also have these repeated characters. Korean character sets have the same issue. ------------------------------------------------------------------------------- When are they used? UNIX users in Japan often do not use the half-width katakana form. However, PC and Macintosh users do. Fullwidth and halfwidth characters tend to be used in different places. For example, an English phrase will be typed in half-width. But a Roman letter that is part of a Japanese word (especially contractions) may use full-width. Software may use this "implicit markup" to key which typesetting rules and glyphs to use. The SGML model does not recognise this kind of markup. There is a partial method to get this second usage in SGML: if all the compatibility zone characters were short reference delimiters. Then for example: HZZH (where H is half-width "H" and Z is fullwidth "Z") can be marked up to mean HZZH which might be useful sometimes (in particular, to translate from EUC or Shift JIS to plain JIS (or a set that doesn't have the duplication). ------------------------------------------------------------------------------- What Could ERCS Do? There are several ways the compatibility zone characters could be handled: 1. make them illegal; 2. treat them as DATACHAR; 3. treat them as short-ref delimiters DELMCHAR (see above); 4. make them NAME and delimiter characters, with no overlap; 5. treat the LETTERS (LATIN + KATAKANA) as lower-case equivalents of the main Upper-case characters (e.g. half-width katakana "KA" is equal to fullwidth katakana "KA"; and fullwidth Latin "a" is equal to halfwidth Latin "A"); 6. alter SGML to allow the compatibility zone characters to be equal in significance to their 'proper' versions, including converting the composed character sequences to their single character equivalent. ------------------------------------------------------------------------------- What is the best solution? 1. Good, simple, encourages explicit markup, but ignores issues 1 & 2 above; 2. Good, but characters have no significance: easiest; 3. Good, but only solves usage 2) above; also, I don't want any LETTERS or KANA to be delimiters: it is confusing; also, 4. Bad: encourages the use of these characters; 5. Doesn't really solve anything: all compatibility zone characters should be treated the same way; 6. Best, but only solves usage 1 above. Also, this requires a big change to ISO 8879 and to parsers. However, the character differences are preserved, so typesetting software can still make use of the "implicit markup". The ERCS earlier drafts proposed & DOCP provisionally endorsed #6. However, later discussion and discussion at WG8 in Broomfield, October 1995, decided that the third option was most practical. Though, should further experience (and requests from other national bodies) demand it, the meeting was not antagonistic to increasing the SGML parser's notion of equivalence. ------------------------------------------------------------------------------- SGML declaration for EUC March 18, 1995 ------------------------------------------------------------------------------- Here is an SGML declaration for the system. -- 0xA1C4 43 0xA1C4 0xA2A1 94 0xA2A1 -- 0xA3A1 94 0x00A1 map zenkaku Roman not allowed pending thought-- 0xA4A1 94 0xA4A1 0xA5A1 94 0xA5A1 0xA6A1 94 0xA6A1 0xA7A1 94 0xA7A1 0xA8A1 94 0xA8A1 0xA9A1 94 0xA9A1 0xA0A1 94 0xA9A1 0xAAA1 94 0xAAA1 0xABA1 94 0xABA1 0xACA1 94 0xACA1 0xADA1 94 0xADA1 0xAEA1 94 0xAEA1 0xAFA1 94 0xAFA1 0xB1A1 94 0xB1A1 ... 0xFEA1 94 0xFEA1 -- BASESET "ISO Registration Number 13//CHARSET JIS X 0201-1986//EN" DESCSET 161 63 0xA5A1 -- -- hankaku katakana not allowed pending resolution of problem -- FUNCTION RE 13 RS 10 SPACE 32 TAB SEPCHAR 9 FULLSPACE SEPCHAR 0xA1A1 NAMING LCNMSTRT -- not yet finished -- 161 ... 223 -- substitute: never used -- 0xA521 ... 0xA2FE ... 0xFE21 ... 0xFEFE UCNMSTRT 161 ... 223 -- substitute: never used -- 0xA521 ... 0xA2FE ... 0xFE21 ... 0xFEFE LCNMCHAR ".-" UCNMCHAR ".-" NAMECASE GENERAL YES ENTITY NO DELIM GENERAL SGMLREF SHORTREF SGMLREF 0xA1A2 ... 0xA1FE 0xA221 ... 0xA2FE 0xA8A1 ... 0xA8CF NAMES SGMLREF QUANTITY SGMLREF ATTSPLEN 1920 -- ?? -- LITLEN 240 -- ?? -- NAMELEN 64 -- ?? -- PILEN 1920 -- ?? -- TAGLEN 1920 -- ?? -- ------------------------------------------------------------------------------- European Application Nov 30, 1995 ------------------------------------------------------------------------------- ERCS has been prompted by East Asian needs in general and CJK document needs in particular. Nevertheless, it should be applicable to European national & Economic Community needs as well. For a country that uses ISO 8859-2 as its national character set, for example, native-language tagging dictates that more than ISO 8879:1986 RCS be used: so "-//SPREAD ERCS//SYNTAX Extended (ISO 8859-2 repertoire)//EN" would be useful. For multilingual, multiple-character-set documents, each could be created with the ERCS of it's character set's repertoire. In other words, one document of an ISO 8859-2 nation could be created using "-//SPREAD ERCS//SYNTAX Extended (ISO 8859-2 repertoire)//EN" while another document of an ISO 8859-3 nation could be created using "-//SPREAD ERCS//SYNTAX Extended (ISO 8859-3 repertoire)//EN". But both could be then incorporated into the same document (as SUBDOC documents for example) merely by giving the full "-//SPREAD ERCS//SYNTAX Extended (ISO/IEC 10646-1:1993 repertoire)//EN" as the document syntax (the same syntax has to apply to subdocuments as well as documents). The Entity Manager would have to convert the incoming documents into the appropriate composite or unified character set of course, for example ISO/IEC 10646-1 (Unicode). This seems to present a workable alternative to the current "lowest-common-denominator" approach of using ISO 646 as the syntax reference character set for RCS. While using RCS is certainly useful for many such things, it seems strange that, for a fictitious example, in a document from Bulgaria that might have both Cyrillic and Greek subdocuments, both would have to be marked up in ISO 646: in other words, because not all multinational documents involve Latin characters, it seems otiose to invoke them merely for markup. A better example perhaps, might be a Middle East document containing both Hebrew and Arabic subdocuments. Using ERCS, the markup need not be in ISO 646, but could use words in the same language as the documents: words presumably more familiar to the users of the documents. ------------------------------------------------------------------------------- Suggested Strategy for Implementing ERCS in Products Dec 8, 1995 ------------------------------------------------------------------------------- * Vendors should support elipses in the SGML declaration syntax declaration, for NAMING and SHORTREFs. (Proposed for ISO 8879) * Vendors should support NMCHAR and NMSTRT character classes in NAMING declaration. (Proposed for ISO 8879) * Vendors should support invocation of syntax declaration through public identifiers. * Vendors should specify (in system declaration)\ whether their system allows 16-bit markup characters, if there are restrictions on the number of shortrefs or SDATA entities. Any 8-bit dependencies should be noted. * Vendors should make sure that their products support long LITLEN. * Include a copy of "-//SPREAD ERCS//SYNTAX Extended (ISO/IEC 10646-1:1993 repertoire)//EN" * Include copies of "-//SPREAD ERCS//SYNTAX Extended (ISO 8859-1 repertoire)//EN" to "-//SPREAD ERCS//SYNTAX Extended (ISO 8859-9 repertoire)//EN" as they are developed with distributions of the SGML application. * Implement the algorithmic processing of ISO/IEC 10646-1:1993 public entity characters (i.e. those beginning with U), and build them in so they don't need to be read in from a file. (If the proposed CHAR declarations are not accepted into ISO 8879.) * Consider building or bundling tools to help clean up incoming data; in error handling of name misspelling, consider the half-width full-width equivalence and the small and normal katakana equivalence for Japanese; for Indic and Middle Eastern languages consider canonical ordering of characters; for European and accented languages consider the equivalence of combined characters and of base+accent sequences. This will help robustness, without sacrificing validatability. ------------------------------------------------------------------------------- Subset Repertoires November 30, 1995 ------------------------------------------------------------------------------- Where a document is to be parsed on a system with a system character set that is a subset of ISO/IEC 10646-1, or where a document will be stored or transmitted using such a character subset, it is prudent to restrict the repertoire of characters that can be used for markup and CDATA data content. Because this is likely to be the normal case, several subsets have been defined: if a user on a Unicode 1.1 system is making SGML documents for an Western European country, the ISO 8859-1 subset is appropriate. Or a Japanese user creating a document in EUC, but with a target of a shift-JIS computer, would be wise to adopt the discipline of only using JIS X 0208:1990 characters for markup and CDATA data content (perhaps using the ISO/IEC 10646-1 Public Character Entities to refer to other characters): * "-//SPREAD ERCS//SYNTAX Extended (ISO/IEC 10646-1:1993 repertoire)//EN" * "-//SPREAD ERCS//SYNTAX Extended (ISO 646 repertoire)//EN" * "-//SPREAD ERCS//SYNTAX Extended (ISO 8859-1 repertoire)//EN" * "-//SPREAD ERCS//SYNTAX Extended (PC 1252 repertoire)//EN" * "-//SPREAD ERCS//SYNTAX Extended (ISO 8859-n repertoire)//EN" * "-//SPREAD ERCS//SYNTAX Extended (JIS X 0208:1990 repertoire)//EN" * "-//SPREAD ERCS//SYNTAX Extended (JIS X 0208:1990 + JIS X 0212:1990 repertoire)//EN" Note: for the sake of illustration, the formal public identifiers "-//SPREAD ERCS//SYNTAX Extended (xxx repertoire)//EN" are being used. (The "xx" should be filled in later.) The public identifier and many details may change. ------------------------------------------------------------------------------- "-//SPREAD ERCS//SYNTAX Extended (ISO/IEC 10646-1:1993 repertoire)//EN" March 18, 1995 ------------------------------------------------------------------------------- This is the syntax for use with Version 1.1 of the Unicode Character Standard and ISO/IEC 10646-1 BMP character sets. See Rules for Determining Syntax for information on how this has been generated. See also Syntax Extensions needed for External Syntax Entities for information on the non-standard syntax used below. SHUNCHAR CONTROLS BASESET "ISO Registration Number 176//CHARSET ISO/IEC 10646-1:1993 UCS-2 with implementation level 3//ESC 2/5 2/15 4/5" DESCSET 0 65536 0 -- 16 bit -- FUNCTION RE 13 RS 10 SPACE 32 TAB SEPCHAR 9 "NO-BREAK-SPACE" SEPCHAR 160 "EN-QUAD" SEPCHAR 8192 "EM-QUAD" SEPCHAR 8193 "EN-SPACE" SEPCHAR 8194 "EM-SPACE" SEPCHAR 8195 "THREE-PER-EM-SPACE" SEPCHAR 8196 "FOUR-PER-EM-SPACE" SEPCHAR 8197 "SIX-PER-EM-SPACE" SEPCHAR 8198 "FIGURE-SPACE" SEPCHAR 8199 "PUNCTUATION-SPACE" SEPCHAR 8200 "THIN-SPACE" SEPCHAR 8201 "HAIR-SPACE" SEPCHAR 8202 "ZERO-WIDTH-SPACE" SEPCHAR 8203 "IDEOGRAPHIC-SPACE" SEPCHAR 12288 "ZERO-WIDTH-NO-BREAK-SPACE" SEPCHAR 65279 NAMING LCNMSTRT 12353 -- hiragana small vowels A, I, U, E, O -- 12355 12357 12359 12361 12387 -- hiragana small TU, YA, YU, YO, WA -- 12419 12421 12423 12430 12449 -- katakana small vowels -- 12451 12453 12455 12457 12483 -- katakana small TU, YA, YU, YO, WA -- 12515 12517 12519 12526 224 ... 246 248 ... 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 291 293 295 297 299 301 303 307 309 311 314 316 318 320 322 324 326 328 331 333 335 337 339 341 343 345 347 349 351 353 355 357 359 361 363 365 367 369 371 373 375 378 380 382 387 389 392 396 402 409 417 419 421 424 429 432 436 438 441 445 453 454 456 457 459 460 462 464 466 468 470 472 474 476 479 481 483 485 487 489 491 493 495 498 499 501 507 509 511 513 515 517 519 521 523 525 527 529 531 533 535 595 596 599 ... 601 603 608 611 616 617 623 626 643 648 650 651 658 940 ... 943 945 ... 961 963 ... 974 976 977 981 982 995 997 999 1001 1003 1005 1007 ... 1009 1072 ... 1103 1105 ... 1116 1118 1119 1121 1123 1125 1127 1129 1131 1133 1135 1137 1139 1141 1143 1145 1147 1149 1151 1153 1169 1171 1173 1175 1177 1179 1181 1183 1185 1187 1189 1191 1193 1195 1197 1199 1201 1203 1205 1207 1209 1211 1213 1215 1218 1220 1224 1228 1233 1235 1237 1239 1241 1243 1245 1247 1249 1251 1253 1255 1257 1259 1263 1265 1267 1269 1273 1377 ... 1414 7681 7683 7685 7687 7689 7691 7693 7695 7697 7699 7701 7703 7705 7707 7709 7711 7713 7715 7717 7719 7721 7723 7725 7727 7729 7731 7733 7735 7737 7739 7741 7743 7745 7747 7749 7751 7753 7755 7757 7759 7761 7763 7765 7767 7769 7771 7773 7775 7777 7779 7781 7783 7785 7787 7789 7791 7793 7795 7797 7799 7801 7803 7805 7807 7809 7811 7813 7815 7817 7819 7821 7823 7825 7827 7829 7841 7843 7845 7847 7849 7851 7853 7855 7857 7859 7861 7863 7865 7867 7869 7871 7873 7875 7877 7879 7881 7883 7885 7887 7889 7891 7893 7895 7897 7899 7901 7903 7905 7907 7909 7911 7913 7915 7917 7919 7921 7923 7925 7927 7929 7936 ... 7943 7952 ... 7957 7968 ... 7975 7984 ... 7991 8000 ... 8005 8017 8019 8021 8023 8032 ... 8039 8048 ... 8061 8064 ... 8071 8080 ... 8087 8096 ... 8103 8112 8113 8115 8131 8144 8145 8160 8161 8165 8179 8560 ... 8575 9424 ... 9449 UCNMSTRT 12354 -- hiragana normal vowels AIUEO -- 12356 12358 12360 12362 12388 -- hiragana normal TU, YA, YU, YO, WA ) 12420 12422 12424 12430 12450 -- katakana normal vowels -- 12452 12454 12456 12458 12484 -- katakana normal TU, YA, YU, YO, WA -- 12516 12518 12520 12527 192 ... 214 216 ... 222 376 256 258 260 262 264 266 268 270 272 274 276 278 280 282 284 286 288 290 292 294 296 298 300 302 306 308 310 313 315 317 319 321 323 325 327 330 332 334 336 338 340 342 344 346 348 350 352 354 356 358 360 362 364 366 368 370 372 374 377 379 381 386 388 391 395 401 408 416 418 420 423 428 431 435 437 440 444 452 452 455 455 458 458 461 463 465 467 469 471 473 475 478 480 482 484 486 488 490 492 494 497 497 500 506 508 510 512 514 516 518 520 522 524 526 528 530 532 534 385 390 394 398 ... 400 403 404 407 406 412 413 425 430 433 434 439 902 904 ... 906 913 ... 929 931 ... 939 908 910 911 914 920 934 928 994 996 998 1000 1002 1004 1006 922 929 1040 ... 1071 1025 ... 1036 1038 1039 1120 1122 1124 1126 1128 1130 1132 1134 1136 1138 1140 1142 1144 1146 1148 1150 1152 1168 1170 1172 1174 1176 1178 1180 1182 1184 1186 1188 1190 1192 1194 1196 1198 1200 1202 1204 1206 1208 1210 1212 1214 1217 1219 1223 1227 1232 1234 1236 1238 1240 1242 1244 1246 1248 1250 1252 1254 1256 1258 1262 1264 1266 1268 1272 1329 ... 1366 7680 7682 7684 7686 7688 7690 7692 7694 7696 7698 7700 7702 7704 7706 7708 7710 7712 7714 7716 7718 7720 7722 7724 7726 7728 7730 7732 7734 7736 7738 7740 7742 7744 7746 7748 7750 7752 7754 7756 7758 7760 7762 7764 7766 7768 7770 7772 7774 7776 7778 7780 7782 7784 7786 7788 7790 7792 7794 7796 7798 7800 7802 7804 7806 7808 7810 7812 7814 7816 7818 7820 7822 7824 7826 7828 7840 7842 7844 7846 7848 7850 7852 7854 7856 7858 7860 7862 7864 7866 7868 7870 7872 7874 7876 7878 7880 7882 7884 7886 7888 7890 7892 7894 7896 7898 7900 7902 7904 7906 7908 7910 7912 7914 7916 7918 7920 7922 7924 7926 7928 7944 ... 7951 7960 ... 7965 7976 ... 7983 7992 ... 7999 8008 ... 8013 8025 8027 8029 8031 8040 ... 8047 8122 8123 8136 ... 8139 8154 8155 8184 8185 8170 8171 8186 8187 8072 ... 8079 8088 ... 8095 8104 ... 8111 8120 8121 8124 8140 8152 8153 8168 8169 8172 8188 8544 ... 8559 9398 ... 9423 223 304 305 312 329 383 384 393 397 405 410 411 414 415 422 426 427 442 443 446 ... 451 477 496 592 ... 594 597 598 602 604 ... 607 609 610 612 ... 615 618 ... 622 624 625 627 ... 642 644 ... 647 649 652 ... 657 659 ... 680 912 944 962 978 ... 980 986 988 990 992 1010 1011 1216 1415 1488 ... 1514 1520 ... 1522 1569 ... 1594 1600 ... 1610 1649 ... 1719 1722 ... 1726 1728 ... 1742 1744 ... 1747 1749 2309 ... 2361 2392 ... 2401 2437 ... 2444 2447 2448 2451 ... 2472 2474 ... 2480 2482 2486 ... 2489 2524 2525 2527 ... 2529 2544 2545 2565 ... 2570 2575 2576 2579 ... 2600 2602 ... 2608 2610 2611 2613 2614 2616 2617 2649 ... 2652 2654 2693 ... 2699 2701 2703 ... 2705 2707 ... 2728 2730 ... 2736 2738 2739 2741 ... 2745 2784 2821 ... 2828 2831 2832 2835 ... 2856 2858 ... 2864 2866 2867 2870 ... 2873 2908 2909 2911 ... 2913 2949 ... 2954 2958 ... 2960 2962 ... 2965 2969 2970 2972 2974 2975 2979 2980 2984 ... 2986 2990 ... 2997 2999 ... 3001 3056 ... 3058 3077 ... 3084 3086 ... 3088 3090 ... 3112 3114 ... 3123 3125 ... 3129 3168 3169 3205 ... 3212 3214 ... 3216 3218 ... 3240 3242 ... 3251 3253 ... 3257 3294 3296 3297 3333 ... 3340 3342 ... 3344 3346 ... 3368 3370 ... 3385 3424 3425 3585 ... 3632 3634 3635 3648 ... 3654 3663 3674 3675 3713 3714 3716 3719 3720 3722 3725 3732 ... 3735 3737 ... 3743 3745 ... 3747 3749 3751 3754 3755 3757 3758 3760 3762 3763 3773 3776 ... 3780 4256 ... 4293 4304 ... 4342 4352 ... 4441 4447 ... 4514 4520 ... 4601 7830 ... 7834 8016 8018 8020 8022 8114 8116 8118 8119 8130 8132 8134 8135 8146 8147 8150 8151 8162 ... 8164 8166 8167 8178 8180 8182 8183 8204 ... 8207 8234 ... 8238 8298 ... 8303 12295 12321 ... 12329 -- hangzhou numerals -- 12363 ... 12386 -- hiragana (smalls & equivs earlier) -- 12389 ... 12418 12425 ... 12429 12431 ... 12538 12459 ... 12482 -- katakana (not smalls & UC equivs) -- 12484 ... 12514 12521 ... 12525 12528 ... 12538 12549 ... 12588 -- bopomofo -- 12593 ... 12686 -- hangul elements -- 13312 ... 40869 -- Han ideographs -- 63744 ... 64046 -- Han ideographs compatibility -- LCNMCHAR 45 46 UCNMCHAR 45 46 168 175 180 184 688 ... 734 736 ... 745 768 ... 837 864 865 890 900 901 1155 ... 1158 1369 1456 ... 1465 1467 ... 1469 1471 1473 1474 1611 ... 1618 1632 ... 1641 1648 1750 ... 1768 1770 ... 1773 1776 ... 1785 2305 ... 2307 2364 2366 ... 2381 2385 ... 2388 2402 2403 2406 ... 2415 2433 ... 2435 2492 2494 ... 2500 2503 2504 2507 ... 2509 2519 2530 2531 2534 ... 2543 2562 2620 2622 ... 2626 2631 2632 2635 ... 2637 2662 ... 2673 2689 ... 2691 2748 2750 ... 2757 2759 ... 2761 2763 ... 2765 2790 ... 2799 2817 ... 2819 2876 2878 ... 2883 2887 2888 2891 ... 2893 2902 2903 2918 ... 2927 2946 2947 3006 ... 3010 3014 ... 3016 3018 ... 3021 3031 3047 ... 3055 3073 ... 3075 3134 ... 3140 3142 ... 3144 3146 ... 3149 3157 3158 3174 ... 3183 3202 3203 3262 ... 3268 3270 ... 3272 3274 ... 3277 3285 3286 3302 ... 3311 3330 3331 3390 ... 3395 3398 ... 3400 3402 ... 3405 3415 3430 ... 3439 3633 3636 ... 3642 3655 ... 3662 3664 ... 3673 3761 3764 ... 3769 3771 3772 3784 ... 3789 3792 ... 3801 8125 ... 8129 8141 ... 8143 8157 ... 8159 8173 ... 8175 8189 8190 8400 ... 8417 9332 ... 9340 9352 ... 9360 9450 12330 ... 12335 12441 ... 12446 -- hiragana marks -- 12540 ... 12542 -- katakana marks -- NAMECASE GENERAL YES ENTITY NO DELIM GENERAL SGMLREF SHORTREF SGMLREF " " ... "§" "©" ... "®" "°" ... "³" "µ" ... "·" "¹" ... "¿" "×" "÷" "ʹ" "͵" ";" "·" "҂" "՚" ... "՟" "։" "־" "׀" "׃" "׳" "״" "،" "؛" "؟" "٪" ... "٭" "۔" "۩" "ऽ" "ॐ" "।" "॥" "॰" "৲" ... "৺" "ੲ" ... "ੴ" "ઽ" "ૐ" "ଽ" "୰" "฿" "ຯ" "ໆ" "ໜ" "ໝ" "჻" " " ... "‮" "‰" ... "⁆" "" ... "⁰" "⁴" ... "₎" "₠" ... "₪" "℀" ... "ℸ" "⅓" ... "ↂ" "←" ... "⇪" "∀" ... "⋱" "⌀" "⌂" ... "⍺" "␀" ... "␤" "⑀" ... "⑊" "①" ... "⑳" "⑽" ... "⒇" "⒑" ... "ⓩ" "─" ... "▕" "■" ... "◯" "☀" ... "☓" "☚" ... "♯" "✁" ... "✄" "✆" ... "✉" "✌" ... "✧" "✩" ... "❋" "❍" "❏" ... "❒" "❖" "❘" ... "❞" "❡" ... "❧" "❶" ... "➔" "➘" ... "➯" "➱" ... "➾" " " ... "〆" "〈" ... "〠" "〰" ... "〷" "〿" "・" "㆐" ... "㆟" "㈀" ... "㈜" "㈠" ... "㉃" "㉠" ... "㉻" "㉿" ... "㊰" "㋀" ... "㋋" "㋐" ... "㋾" "㌀" ... "㍶" "㍻" ... "㏝" "㏠" ... "㏾" "﬩" "﴾" "﴿" "︰" ... "﹄" "﹉" ... "﹒" "﹔" ... "﹦" "﹨" ... "﹫" "" "!" ... "/" ":" ... "@" "[" ... "`" "{" ... "~" "。" ... "・" "ー" "゙" "゚" "¢" ... "₩" "│" ... "○" "�" NAMES SGMLREF QUANTITY SGMLREF -- To be determined -- ATTSPLEN 1920 -- ?? -- LITLEN 240 -- ?? -- NAMELEN 240 -- ?? -- PILEN 1920 -- ?? -- TAGLEN 1920 -- ?? -- ------------------------------------------------------------------------------- Status of ERCS November 30, 1995 ------------------------------------------------------------------------------- The ERCS was endorsed by the China/Japan/Korea Document Processing ad hoc committee (liason to ISO WG8). I am using SPREAD ERCS in the current public identifier. The ERCS is being developed for future SGML Open consideration, and has not been adopted, recommended or authorized by SGML Open. Members of SGML Open have been involved, and SGML kindly hosted earlier drafts of ERCS on its website. ------------------------------------------------------------------------------- TITLE : Liaison to Mr. R. Jelliffe from CJK DOCP SOURCE : CJK DOCP DATE : March 3, 1995 With regard to +//SGML Open:1995//SYNTAX Extended//EN, East Asian Document Issues: A proposal for an extended reference concrete syntax by Rick Jelliffe, January 7, 1995. It is the consensus of the participants at the 7th CJK DOCP meeting (Kanazawa Japan, 2-3 March 1995) that Mr. Jelliffe's proposal appears to solve many of the problems with using ISO/IEC 10646 with SGML for East Asian document processing. The CJK DOCP working project is interested in working on the development of this proposal with the goal of its eventual incorporation into a revision of ISO 8879 standard. ------------------------------------------------------------------------------- In my capacity as Australian delegate to ISO WG8 I presented the suggestions for changes to ISO 8879 syntax and system declaration to the BroomField, USA meeting of October 1995. The proposals were accepted in principle, and had the support of Japan in particular. These are available in RTF files as notes WG8 N 1815 and N 1816. ------------------------------------------------------------------------------- CJK Issues November 30, 1995 ------------------------------------------------------------------------------- Han Ideographic Characters All Han ideographic characters are NAME characters. ------------------------------------------------------------------------------- Spaces All spaces are SEPCHARs. ------------------------------------------------------------------------------- Gaiji and User-defined Characters All user-defined characters should be SHORTREFs. ------------------------------------------------------------------------------- Numbers All native numbers and digits except for the DIGITs (0..9) should be NMSTRT characters. (This does not apply to superscript, subscript, numbers inside circles, and other characters that don't represent simple numbers: these are just SHORTREFs.) ------------------------------------------------------------------------------- Repeat Characters The KANA REPEAT character is a NMCHAR character, but is best avoided. ------------------------------------------------------------------------------- Half-width and Full-width characters (Zenkaku/Hankaku) and Character Substitution For a detailed discussion on compatibility zone characters, see Compatibility Zone Characters. The CJK DOCP group recommends that the half-width and full-width characters should not be treated as equivalent for purposes of markup. However use of halfwidth katakana and fullwidth roman letters and numbers in markup is deprecated. The ISO 8879:1986 model could be extended, while maintaining backwards compatibility by generalizing the case-mapping mechanism into a string->character mapping mechanism. This would resolve the half-width/full-width problem properly, and also meet similar needs in several other languages. A related problem that some languages have in some character sets (Thai, Hebrew, Arabic, and perhaps Vietnamese) is that the same glyph-character can be spelled in several different ways. This problem is particularly present in ISO/IEC 10646-1:1993 (Unicode 1.1). This may be regarded as a normalisation problem for the entity manager (or, for example, FIND rules in an OmniMark context-translation) to handle. Vendors of tools would do well to add some proprietary spelling correctors for this specific problem. ------------------------------------------------------------------------------- Japanese Examples * Example of a simple SGML declaration * Example of an SGML declaration for EUC * "-//SPREAD ERCS//SYNTAX Extended (JIS X 0208:1990 repertoire//EN" * "-//SPREAD ERCS//SYNTAX Extended (JIS X 0208:1990 + JIS X 0212:1990 repertoire)//EN" ------------------------------------------------------------------------------- Change Log and Significant Queries Dec 12, 1995 ------------------------------------------------------------------------------- Dec 12, 1995 (Tony Graham) Combining Diacritical for Symbols can't be SHORTREF, if we want to keep only single character SHORTREFs. Changed. Nov 30, 1995 (Rick Jelliffe) simplify & homologate. Use "SPREAD ERCS" instead of "SGML Open:TR95xx" for convenience. Sept 4, 1995 (Rick Jelliffe) user-defined are SHORTREF. Page on Gaiji added. (Prof Eiji Matsuoka, SGML Asia/Pacific '95) Use Heiwa Kanten etc as character entity catalogue for characters that can't be found in ISO 10646 (+locale). (following WG8 Broomfield) Page on ERCS system identifiers removed: superceded by new HyTime FSI, which is essentially the same. External numeric character references & CHAR declaration mooted. Add notes presented at WG8. (CJK DOCP Alaska meeting) don't fold half- & fullwidth, and small and normal katakana. NMCHAR and NMSTART classes endorsed. No need for NUMBER to include native digits. June 15, 1995 (Rick Jelliffe) entity invoked as &Uxxxx; rather than &U-xxxx; to fit in with ISO WG9 CD 14755 and TERENA. Page on European application added. March 30, 1995 (Rick Jelliffe) Ignore restricted zone compatibility characters while justifications are gathered (user's SGML decl decides if DATACHAR or NSGML). Entity subsets. Fix rules. Add more info on restricted zone. Dump PIs for SYSTEM identifier prefix. (Glenn Adams) Standard names, terminology & phrasing improvements. Extended digits? (James Clark) Standard names, terminology & phrasing improvements. Extended digits? Attributes can be replaceable character data. Refer to new Unicode character classes for SGML character classes. Suggestions for syntax changes (provides custom version of SP for testing and demonstrating these): SGML production 189 extension in Handbook regarded as essential minimum. Use SYSTEM identifier instead of PI? Simplify spaces. Compatibility zone handling method requires too much change to parsers? Announcement of details of ERCS-support features for SP. (Gavin Nicols) Character equivalence discussion. Applicability to WWW. March 8, 1995 First public request for comment. To Do: Complete shift-JIS & EUC declarations ------------------------------------------------------------------------------- (C) 1995 Rick Jelliffe. May be freely copied and translated. Email comments ricko@allette.com.au Recent development version in http://www.allette.com.au/sgml/ercs/ercs-home.html