The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
SEARCH | ABOUT | INDEX | NEWS | CORE STANDARDS | TECHNOLOGY REPORTS | EVENTS | LIBRARY
SEARCH
Advanced Search
ABOUT
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

NEWS
Cover Stories
Articles & Papers
Press Releases

CORE STANDARDS
XML
SGML
Schemas
XSL/XSLT/XPath
XLink
XML Query
CSS
SVG

TECHNOLOGY REPORTS
XML Applications
General Apps
Government Apps
Academic Apps

EVENTS
LIBRARY
Introductions
FAQs
Bibliography
Technology and Society
Semantics
Tech Topics
Software
Related Standards
Historic
Last modified: July 08, 2005
XML and Unicode

Contents

The Unicode Standard is "a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world. It is a fundamental component of all modern software and information technology protocols. In addition, it supports classical and historical texts of many written languages. It provides a uniform, universal architecture and encoding (with over 96,000 characters currently encoded) and is the basis for processing, storage, and seamless data interchange of text data worldwide. Unicode is required by modern standards such as XML, Java, C#, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, IDN, etc., and is the official way to implement ISO/IEC 10646."

Unicode is the basis for XML: legal XML characters "are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646, and all XML processors must accept the UTF-8 and UTF-16 encodings of Unicode 3.1. The Extensible Markup Language (XML) 1.0 Third Edition specification, as a W3C Recommendation, normatively references associated standards (Unicode and ISO/IEC 10646 for characters, Internet RFC 3066 for language identification tags, ISO 639 for language name codes, and ISO 3166 for country name codes) to provides all the information necessary to understand XML Version 1.0 and construct computer programs to process it."

In March 2005 the Unicode Consortium announced the release of Version 4.1.0 of the Unicode Standard. This new version adds 1,273 new characters, including those necessary to complete roundtrip mapping of the HKSCS and GB 18030 standards, five new currency signs, some characters for Indic and Korean, and eight new scripts (New Tai Lue, Buginese, Glagolitic, Coptic, Tifinagh, Syloti Nagri, Old Persian, Kharoshthi). There are additions for Biblical Hebrew and editorial marks for Biblical Text annotation. Unicode 4.1.0 adds two new Unicode Standard Annexes: UAX #31 (Identifier and Pattern Syntax) and UAX #34 (Unicode Named Character Sequences). Significant additions and changes to the Unicode Character Database properties have been made which determine the behavior of characters in modern software. The release of Unicode 4.1 [was to] be soon followed by a new release of the Unicode Collation Algorithm, for language-sensitive sorting, searching, and matching; by Unicode Regular Expressions, setting the standard for handling Unicode character in regular expressions; and by a new draft of Unicode Security Considerations.

Unicode version 3.0.0 was announced in September 1999. "Unicode 3.0 is the next major release of the Unicode Standard, to be published in early 2000 as The Unicode Standard, Version 3.0. Because of the significance of this release, the Unicode Technical Committee has finalized the Unicode Character Database Version 3.0 in advance of the publication of the book, allowing normative reference to that data."

XML-Related Unicode Technical Reports

  • Character Encoding Model. Unicode Technical Report #17. 2004-09-09. "This report describes a model for the structure of character encodings. The Unicode Character Encoding Model places the Unicode Standard in the context of other character encodings of all types, as well as existing models such as the character architecture promoted by the Internet Architecture Board (IAB) for use on the internet, or the Character Data Representation Architecture (CDRA) defined by IBM for organizing and cataloging its own vendor-specific array of character encodings... The mapping from a sequence of members of an abstract character repertoire to a serialized sequence of bytes is called a Character Map (CM). A simple character map thus implicitly includes a CCS, a CEF, and a CES, mapping from abstract characters to code units to bytes. A compound character map includes a compound CES, and thus includes more than one CCS and CEF. In that case, the abstract character repertoire for the character map is the union of the repertoires covered by the coded character sets involved. UTR-22 'Character Mapping Markup Language' defines an XML specification for representing the details of Character Maps."

  • Unicode in XML and Other Markup Languages. Unicode Technical Report #20. W3C Note 13 June 2003. Technical Report published jointly by the Unicode Technical Committee and by the W3C Internationalization Working Group/Interest Group in the context of the W3C Internationalization Activity. See the W3C version.

  • Character Mapping Markup Language (CharMapML). Unicode Technical Standard #22. "This document specifies an XML format for the interchange of mapping data for character encodings, and describes some of the issues connected with the use of character conversion. It provides a complete description for such mappings in terms of a defined mapping to and from Unicode, and a description of alias tables for the interchange of mapping table names..." See the local reference.

  • The Unicode CHARACTER Property Model. Unicode Technical Report #23. 2004-07-12. This report presents a conceptual model of character properties defined in the Unicode Standard..

  • Unicode Support for Mathematics. Unicode Technical Report #25. "Starting with version 3.2, Unicode includes virtually all of the standard characters used in mathematics. This set supports a variety of math applications on computers, including document presentation languages like TeX, math markup languages like W3C MathML and OpenMath, internal representations of mathematics in systems like Mathematica, Maple, and MathCAD, computer programs, and plain text. This technical report describes the Unicode mathematics character groups and gives some of their imputed default math properties..."

  • Locale Data Markup Language (LDML). Unicode Technical Standard #35. Version 1.3. 2005-06-02. This document describes an XML format (vocabulary) for the exchange of structured locale data. A locale [in this document] "is an id that refers to a set of user preferences that tend to be shared across significant swaths of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for timezones, languages, countries, and scripts. They can also include text boundaries (character, word, line, and sentence), text transformations (including transliterations), and support for other services... There are many different equally valid ways in which data can be judged to be "correct" for a particular locale. The goal for the common locale data is to make it as consistent as possible with existing locale data, and acceptable to users in that locale. This document describes one of those pieces, an XML format for the communication of locale data. With it, for example, collation rules can be exchanged, allowing two implementations to exchange a specification of collation. Using the same specification, the two implementations will achieve the same results in comparing strings..." See "Unicode Consortium Hosts the Common Locale Data Repository (CLDR) Project" and "Unicode Releases Common Locale Data Repository, Version 1.3."

  • Unicode Security Considerations. By Mark Davis (IBM) and Michel Suignard (Microsoft). Unicode Technical Report #36. Revision 3 date: 2005-07-07. The TR "describes some of the security considerations that programmers, system analysts, standards developers, and users should take into account [when using the Unicode Standard], and provides specific recommendations to reduce the risk of problems." A number of visual security issues have arisen in connection with (visual) spoofing, and this threat provides the basis for the technical report. The new Unicode Security Considerations Technical Report from the Unicode Consortium "provides an initial step towards reducing the risk of such problems while preserving the ability to have internationalized domain names for all the modern languages of the world." Security issues include identified and addressed in the report include Internationalized Domain Names, Mixed-Script Spoofing, Single-Script Spoofing, Inadequate Rendering Support, Bidirectional Text Spoofing, Syntax Spoofing, and Numeric Spoofs. In many ways, acording to the TR introduction, "the use of Unicode makes programs much more robust and secure. When systems used a hodge-podge of different charsets for representing characters, there were security and corruption problems that resulted from differences between those charsets, or from the way in which programs converted to and from them. But because Unicode contains such a large number of characters, and because it incorporates the varied writing systems of the world, incorrect usage can expose programs or systems to possible security attacks." See "New Unicode Consortium Technical Report on Unicode Security Considerations."

General: Articles, News, Papers

  • [July 08, 2005]   New Unicode Consortium Technical Report on Unicode Security Considerations.    Unicode Technical Report #36 on Unicode Security Considerations "describes some of the security considerations that programmers, system analysts, standards developers, and users should take into account [when using the Unicode Standard], and provides specific recommendations to reduce the risk of problems." A number of visual security issues have arisen in connection with (visual) spoofing, and this threat provides the basis for the technical report. The new Unicode Security Considerations Technical Report from the Unicode Consortium "provides an initial step towards reducing the risk of such problems while preserving the ability to have internationalized domain names for all the modern languages of the world." Security issues identified and addressed in the report include Internationalized Domain Names, Mixed-Script Spoofing, Single-Script Spoofing, Inadequate Rendering Support, Bidirectional Text Spoofing, Syntax Spoofing, and Numeric Spoofs. In many ways, acording to the TR introduction, "the use of Unicode makes programs much more robust and secure. When systems used a hodge-podge of different charsets for representing characters, there were security and corruption problems that resulted from differences between those charsets, or from the way in which programs converted to and from them. But because Unicode contains such a large number of characters, and because it incorporates the varied writing systems of the world, incorrect usage can expose programs or systems to possible security attacks." The authors of the Unicode Security Considerations Technical Report envision that the document "should grow over time, adding additional sections as needed. Initially, it is organized into two sections: visual security issues and non-visual security issues. Each section presents background information on the kinds of problems that can occur, then lists specific recommendations for reducing the risk of such problems."

  • [June 02, 2005] "Unicode Releases Common Locale Data Repository, Version 1.3." - "The Unicode Consortium announced today the release of new versions of the Common Locale Data Repository (CLDR 1.3) and the Locale Data Markup Language specification (LDML 1.3), providing key building blocks for software to support the world's languages. CLDR is by far the largest standard repository of locale data. This new release contains data for 296 locales: 96 languages and 130 territories. For the first time in CLDR, POSIX formatted data is also available. To support users in different languages, programs must not only use translated text, but must also be adapted to local conventions. These conventions differ by language or region and include the formatting of numbers, dates, times, and currency values, as well as support for differences in measurement units or text sorting order. Most operating systems and many application programs currently maintain their own repositories of locale data to support these conventions. But such data are often incomplete, idiosyncratic, or gratuitously different from program to program. In the age of the internet, software components must work together seamlessly, without the problems caused by these discrepancies. The CLDR project provides a general XML format, LDML, for the exchange of locale information used in application and system software development, combined with a public repository for a common set of locale data in that format. In this release, there are major additions to the CLDR data, to the LDML specification, and in implementation support. For more information about the CLDR project, with details about the new features in this release and the languages and territories supported, see the CLDR web site."

  • [March 31, 2005] "Version 4.1 of the Unicode Standard Released." - The Unicode Consortium announced today the release of the latest version of the Unicode Standard, Version 4.1.0. This version adds 1,273 new characters, including those necessary to complete roundtrip mapping of the HKSCS and GB 18030 standards, five new currency signs, some characters for Indic and Korean, and eight new scripts. In addition, there have been a number of significant additions and changes to the Unicode Character Database properties, which determine the behavior of characters in modern software. Unicode 4.1 adds two new Unicode Standard Annexes: UAX #31: Identifier and Pattern Syntax and UAX #34: Unicode Named Character Sequences, and makes significant changes to other Unicode Standard Annexes. UAX #31 is of particular interest as a result of the broader incorporation of Unicode in protocols and programming languages. Applications from programming languages to international domain names require stable mechanisms for distinguishing both identifiers and syntax characters, even as characters for additional languages are added to the Unicode Standard. The release of Unicode 4.1 will be soon followed by a new release of the Unicode Collation Algorithm, for language-sensitive sorting, searching, and matching; by Unicode Regular Expressions, setting the standard for handling Unicode character in regular expressions; and by a new draft of Unicode Security Considerations, for dealing with security issues posed by the large number of visually-similar characters in Unicode... The Unicode Standard is a fundamental component of all modern software and information technology protocols. It provides a uniform, universal architecture and encoding for all languages of the world — with over 96,000 characters currently encoded — and is the basis for processing, storage, and seamless data interchange of text data worldwide. Unicode is required by modern standards such as XML, Java, C#, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, IDN, etc., and is the official way to implement ISO/IEC 10646..."

  • [October 09, 2003]   W3C Publishes Draft Guidelines for Authoring Internationalized XHTML and HTML.    The World Wide Web Consortium has issued an initial Working Draft for Authoring Techniques for XHTML & HTML Internationalization 1.0. Produced by the Guidelines, Education & Outreach Task Force (GEO) of the Internationalization Working Group, the document describes and illustrates authoring techniques for the creation of "internationalized HTML using XHTML 1.0 or HTML 4.01, supported by CSS1, CSS2 and some aspects of CSS3." Most of the techniques for completed document subsections are implemented in the latest versions of popular Web browsers, so readers can inspect the source code and observe the visual behaviors, where relevant. In this initial WD, sigla are represented for implementation support in three user agents (Internet Explorer 6, Netscape Navigator 7, and Opera 7). The document is organized according to "tasks that a developer of XHMTL/HTML content may want to perform." Document sections at least partially complete include: Document structure and metadata; Character sets, character encodings and entities; Fonts; Specifying the language of content; Handling bidirectional text; Handling data that varies by locale. Subsequent versions of the document will document authoring techniques relating to: Handling vertical text; Text formatting; Lists; Tables; Links; Objects; Images; Forms; Keyboard shortcuts; Writing source text; Navigation; File management; Supplying data for localization. The Working Draft is presented in a full (detail) view, collapsible outline view, and resource view. In the resource view, bibliographic citations are hyperlinked to relevant standards from W3C, IANA, IETF, Unicode Consortium, etc. Icons in the document margins help the reader switch between the detail, outline, and resource views. The W3C GEO Task Force "encourages feedback about the content of this document as well as participation in the development of the techniques by people who have experience creating Web content that conforms to internationalization needs."

  • [August 20, 2003] "[Unicode] Identifier and Pattern Syntax." By Mark Davis. Public review draft from the Unicode Technical Committee. Reference: Proposed Draft, Unicode Technical Report #31. Date: 2003-07-18. "This document describes specifications for recommended defaults for the use of Unicode in the definitions of identifiers and in pattern-based syntax. It incorporates the Identifier section of Unicode 4.0 (somewhat reorganized) and a new section on the use of Unicode in patterns. As a part of the latter, it presents recommended new properties for addition to the Unicode Character Database. Feedback is requested both on the text of the new pattern section and on the contents of the proposed properties... A common task facing an implementer of the Unicode Standard is the provision of a parsing and/or lexing engine for identifiers. To assist in the standard treatment of identifiers in Unicode character-based parsers, a set of specifications is provided here as a recommended default for the definition of identifier syntax. These guidelines are no more complex than current rules in the common programming languages, except that they include more characters of different types. In addition, this document provides a proposed definition of a set of properties for use in defining stable pattern syntax: syntax that is stable over future versions of the Unicode Standard. There are many circumstances where software interprets patterns that are a mixture of literal characters, whitespace, and syntax characters. Examples include regular expressions, Java collation rules, Excel or ICU number formats, and many others. These patterns have been very limited in the past, and forced to use clumsy combinations of ASCII characters for their syntax. As Unicode becomes ubiquitous, some of these will start to use non-ASCII characters for their syntax: first as more readable optional alternatives, then eventually as the standard syntax. For forwards and backwards compatibility, it is very advantageous to have a fixed set of whitespace and syntax code points for use in patterns. This follows the recommendations that the Unicode Consortium made regarding completely stable identifiers, and the practice that is seen in XML 1.1. In particular, the consortium committed to not allocating characters suitable for identifiers in the range 2190..2BFF, which is being used by XML 1.1. With a fixed set of whitespace and syntax code points, a pattern language can then have a policy requiring all possible syntax characters (even ones currently unused) to be quoted if they are literals. By using this policy, it preserves the freedom to extend the syntax in the future by using those characters. Past patterns on future systems will always work; future patterns on past systems will signal an error instead of silently producing the wrong results..." Note: See also the 2003-08-20 notice from Rick McGowan (Unicode, Inc.), said to be relevant to anyone dealing with programming languages, query specifications, regular expressions, scripting languages, and similar domains: "The Proposed Draft UTR #31: Identifier and Pattern Syntax will be discussed at the UTC meeting next week. Part of that document (Section 4) is a proposal for two new immutable properties, Pattern_White_Space and Pattern_Syntax. As immutable properties, these would not ever change once they are introduced into the standard, so it is important to get feedback on their contents beforehand. The UTC will not be making a final determination on these properties at this meeting, but it is important that any feedback on them is supplied as early in the process as possible so that it can be considered thoroughly. The draft is found [online] and feedback can be submitted as described there..."

  • [June 18, 2003]   Updated Unicode Technical Report Clarifies Characters not Suitable for Use With Markup.    A revised version of Unicode in XML and other Markup Languages has been published as Unicode Technical Report #20, Revision 7 and as W3C Note 13-June-2003. This revision reflects three principal changes: (1) The base version of the Unicode Standard for this document is Unicode Version 4.0, which creates some 1,226 new Unicode character assignments; greater prominence is given to material in a new Section 3, "Characters Not Suitable for Use With Markup"; (3) a new Section 6 clarifies the appropriate uses of 66 non-character code points, or Unicode noncharacters. Section 3 discusses characters "which are unsuitable in the context of markup in XML/HTML and whose use is discouraged for one or more reasons; for exmaple, they are deprecated in the Unicode Standard, they are unsupportable without additional data, they are difficult to handle because they are stateful, they are better handled by markup, or because of conflict with equivalent markup." For the character classes in question, the Technical Report provides a short description of semantics, the reason for inclusion of the characters in Unicode, clarification of the specific problems when used with markup, related areas where problems may occur (e.g., in plain text), what kind of markup to use instead of Unicode characters, and what different classes of software should do if the problematic characters detected in a particular context.

  • [September 10, 2002] "Unicode: The Quiet Revolution." By Jim Felici. In The Seybold Report Volume 2, Number 10 (August 19, 2002), pages 11-15. ['Revolutions are supposed to be noisy [but] systematically, quietly, thoroughly, Unicode has changed the way every major operating system works, the way nearly every new document is created. It has put multilingual intelligence into Internet search engines... Most recall that Unicode has something to do with two-byte characters. While that was once true, it isn't any longer. This article looks at Unicode's practical impacts and the direction of the ongoing revolution.'] "...The people who create Web search engines can't embrace Unicode fast enough; for them, it's a revolution that couldn't come too soon. Unicode allows them to create a single search system that will work as well in China and Delhi as in Moscow and New York (not to mention working for New Yorkers in Beijing and Russians in Delhi). A single search can be multilingual, and the same search can be made from anywhere. Database vendors are equally enthusiastic, and for the same reasons; archiving and retrieval issues in repositories are essentially the same as they are on the Web. Nonstandard encodings create information ghettoes, where data can be concealed by virtue of the way it was written. Under the new regime, legacy encodings can be decoded one last time and all data converted into the lingua franca Unicode format. But at this point, Unicode hits a wall: language. It can match numbers to characters, but it's not Unicode's job to match characters to languages... A single code point may identify a particular character, but this says nothing about the language that character was chosen to express. Nor does it say anything about how that character should look, as single Han characters may also vary substantially in form from place to place... The ink that's used to write most standards isn't dry before various parties begin to tack on amendments, enhancements and personalizations... Interestingly, the opposite is happening with Unicode. For example, the employment of private use areas or PUAs -- [private use area] ranges of code points set aside for characters that haven't been made a part of the standard Unicode character set is being discouraged except in closed environments, simply because the code-point defintions aren't standard. 'Many application developers seem to be coming to the conclusion that PUA characters are more trouble than they're worth,' according to John Hudson at Tiro Typeworks in Vancouver, BC. 'Adobe, who have been using PUA assignments for many years, recently decided to draw back from this approach and try to abandon it completely. PUA code points are simply too unreliable.' Most common Latin typographic variants such as alternate ligatures have been given Unicode code points. The myriad alternate forms for many characters in typefaces such as ITC Avant Garde as well as in Asian ideographic languages are also accommodated, with the help of smart font formats, such as OpenType. More standardization by font vendors will translate into more accurate document imaging than ever before, with fewer and fewer exceptions to the rule. 'My gut feeling is that we are still on the learning curve,' says Ken Whistler, a Unicode founding father, now working in internationalization software at Sybase and as technical director of Unicode, Inc., 'but that the worst of the annoyances will be out of the way in the next five years.' Thomas Phinney, program manager for western fonts at Adobe Systems, agrees that the worst part of the switch to Unicode is behind us. 'Five years ago,' he says, 'we were perhaps one tenth of the way up the adoption curve, and now we're something like one-third of the way. Although I fully expect there to be significant holdout' applications, even in five years, we'll be over the top hump of the curve. Unfortunately, that final tailing-off period will take a long time'..."

  • [May 01, 2002]   W3C I18N Working Group Publishes Last Call Working Draft for the WWW Character Model.    The W3C Internationalization Working Group has issued a second Last Call Working Draft specification defining a Character Model for the World Wide Web 1.0. The document is an Architectural Specification designed to provide "a common reference for interoperable text manipulation on the World Wide Web. Topics addressed include encoding identification, early uniform normalization, string identity matching, string indexing, and URI conventions, building on the Universal Character Set, defined jointly by Unicode and ISO/IEC 10646. Some introductory material on characters and character encodings is also provided." The goal of the specification is to "facilitate use of the Web by all people, regardless of their language, script, writing system, and cultural conventions, in accordance with the W3C goal of universal access; one basic prerequisite to achieve this goal is to be able to transmit and process the characters used around the world in a well-defined and well-understood way." The W3C I18N Working Group invites comments on the specification through the end of the review period, May 31, 2002. "Due to the architectural nature of this document, it affects a large number of W3C Working Groups, but also software developers, content developers, and writers and users of specifications outside the W3C that have to interface with W3C specifications. Because review comments play an important role in ensuring a high quality specification, the WG encourages readers to review this Last Call Working Draft carefully." [Full context]

  • [April 04, 2002]   Unicode Consortium Publishes Unicode Standard Version 3.2.    A Proposed Draft Unicode Technical Report published by the Unicode Consortium earlier in 2002 has been advanced to an approved version 3.2 of the Unicode Standard. This edition of the Standard "includes the most extensive set of characters for mathematical and technical publishing yet defined. The Unicode Technical Committee and the Scientific and Technical Information eXchange (STIX) Project of the Scientific and Technical Publishers (STIPub) Consortium worked together over the past 5 years to identify over 1,600 new mathematical symbols and alphanumeric characters, more than doubling the number of characters with mathematical usage previously available. W3C's MathML integrates with developing Web technologies, and makes essential use of the Unicode character set. With the addition of four indigenous scripts of the Philippines, the Unicode Standard moves further towards full coverage of all living writing systems; version 3.2 is now fully synchronized with International Standard ISO/IEC 10646-1:2000, with its Amendment 1, and with ISO/IEC 10646-2:2001. The Unicode Standard is a major component in the globalization of e-business, as the marketplace continues to demand technologies that enhance seamless data interchange throughout companies' extended -- and often international -- network of suppliers, customers and partners. Unicode is the default text representation in XML, an important open standard being rapidly adopted throughout e-business technology." [Full context]

  • [November 05, 2001] "XML Internationalization FAQ." From Opentag.com. "You will find here answers to some of the most frequently asked questions about XML internationalization and localization, including XSL, CSS, and other XML-related technologies..."

  • [August 31, 2001] "Language Identifiers in the Markup Context: Language Tagging in Unicode." This section is part of a collection of resources relating to language tags/codes, language classification projects, and mechanisms for using language identifiers in markup.

  • [December 04, 2001]   W3C Announces Internationalization Workshop.    The World Wide Web Consortium has issued a Call for Participation in a W3C Internationalization Workshop, to be held February 1, 2002 in Washington DC in conjunction with the 20th International Unicode Conference. W3C has decided to strengthen its internationalization work, and will prepare relevant I18N guidelines for XML formats, document creators, webmasters, tools developers, etc. The goal of this workshop is to reevaluate the I18N Activity and to prepare the rechartering of the I18N Activity and Working Groups/Interest Groups by surveying the problems and showcasing existing solutions, raising awareness of the issues, and providing a forum for discussion. The workshop is an open event, but space limitations dictate a limit of forty-five (45) participants; position papers should be submitted for review by January 10, 2002. This open-event workshop is part of the W3C's Internationalization Activity, and supports W3C's commitment "to make the Web accessible to people around the world by promoting technologies that take into account the vast differences of language, script, and culture of users on all continents." [Full context]

  • [December 10, 2001] "Internationalized Resource Identifiers (IRI)." By Larry Masinter (Adobe Systems Incorporated) and Martin Dürst (W3C/Keio University). IETF INTERNET-DRAFT. Reference: 'draft-masinter-url-i18n-08.txt'. November 20, 2001; expires May 2002. Abstract: "This document defines a new protocol element, an Internationalized Resource Identifier (IRI). An IRI is a sequence of characters from the Universal Character Set (ISO 10646). A mapping from IRIs to URIs (RFC 2396) is defined, which means that IRIs can be used instead of URIs where appropriate to identify resources. Defining a new protocol element was preferred to extending or changing the definition of URIs to allow a clear distinction and to avoid incompatibilities with existing software. Guidelines for the use and deployment of IRIs in various protocols, formats, and software components that now deal with URIs are provided." Change from v 07: allows 'space' and a few other characters in IRIs to be consistent with XML, XLink, XML Schema, etc. Design considerations: "IRIs are designed to work together with recent recommendations on URI syntax (RFC 2718). In order to be able to use an IRI (or IRI reference in place of an URI (or URI reference) in a given protocol context, the following conditions have to be met: (a) The protocol or format carrying the IRI has to be able to represent the non-ASCII characters in the IRI, either natively or by some protocol- or format-specific escaping mechanism (e.g., numeric character references in XML). (b) The protocol or format element used has to have been designated to carry IRIs (e.g., by designating it to be of type anyURI in XMLSchema)... Please note that some formats already IRIs, although they use different terminology. HTML 4.0 defines the conversion from IRIs to URIs as error-avoiding behavior. XML 1.0, XLink, and XML Schema and specifications based upon them allow IRIs. Also, it is expected that all relevant new W3C formats and protocols will be required to handle IRIs (see the CharMod document, "Character Model for the World Wide Web 1.0")..." Martin Dürst wrote 2001-12-10 with reference to 'draft-masinter-url-i18n-08.txt': [this ID] about the internationalization of URIs (called IRIs) has recently been updated and published. This has been around for a long time, but we plan to move ahead with it in the very near future. Please have a look at the document, and send me any comments that you have soon..." [cache]

  • [February 18, 2002] "Unicode in XML and other Markup Languages." Unicode Technical Report #20. Revised version [#6]. W3C Note 18-February-2002. Authored by Martin Dürst and Asmus Freytag. Version URLs: [Unicode] http://www.unicode.org/unicode/reports/tr20/tr20-6.html; [W3C] http://www.w3.org/TR/2002/NOTE-unicode-xml-20020218. Latest version URLs: http://www.unicode.org/unicode/reports/tr20/, http://www.w3.org/TR/unicode-xml/. "This document contains guidelines on the use of the Unicode Standard in conjunction with markup languages such as XML. The Technical Report is published jointly by the Unicode Technical Committee and by the W3C Internationalization Working Group/Interest Group in the context of the W3C Internationalization Activity. The base version of the Unicode Standard for this document is Version 3.2 [see following bibliographic entry].

  • [February 18, 2002] Proposed Draft Unicode Technical Report #28. Unicode 3.2. Unicode version 3.2.0. By Members of the Editorial Committee. Date 2002-1-21. Version URL: http://www.unicode.org/unicode/reports/tr28/tr28-2. ['This document defines Version 3.2 of the Unicode Standard. This draft is for review with the intention of it becoming a Unicode Standard Annex. The document has been made available for public review as a Proposed Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.'] "Unicode 3.2 is a minor version of the Unicode Standard. It overrides certain features of Unicode 3.1, and adds a significant number of coded characters... The primary feature of Unicode 3.2 is the addition of 1016 new encoded characters. These additions consist of several Philippine scripts, a large collection of mathematical symbols, and small sets of other letters and symbols. All of the newly encoded characters in Unicode 3.2 are additions to the Basic Multilingual Plane (BMP). Complete introductions to the newly encoded scripts and symbols can be found in Article IV, Block Descriptions... Additional Features of Unicode 3.2: Unicode 3.2 also features amended contributory data files, to bring the data files up to date against the expanded repertoire of characters..."

  • [December 31, 2001] "Character Model for the World Wide Web 1.0." W3C Working Draft 20-December-2001. Interim Working Draft. Version URL: http://www.w3.org/TR/2001/WD-charmod-20011220. Latest version URL: http://www.w3.org/TR/charmod. Edited by Martin J. Dürst (W3C), François Yergeau (Alis Technologies), Richard Ishida (Xerox, GKLS), Misha Wolf (Reuters Ltd.), Asmus Freytag (ASMUS, Inc.), and Tex Texin (Progress Software Corp.). "This Architectural Specification provides authors of specifications, software developers, and content developers with a common reference for interoperable text manipulation on the World Wide Web. Topics addressed include encoding identification, early uniform normalization, string identity matching, string indexing, and URI conventions, building on the Universal Character Set, defined jointly by Unicode and ISO/IEC 10646. Some introductory material on characters and character encodings is also provided." [cache]

  • [October 01, 2001] "Character Model for the World Wide Web 1.0." W3C Working Draft 28-September-2001. Produced by the W3C Internationalization Working Group. Edited by Martin J. Dürst, François Yergeau, Misha Wolf, Asmus Freytag, Tex Texin, and Richard Ishida. Latest version URL: http://www.w3.org/TR/charmod. "The goal of this document is to facilitate use of the Web by all people, regardless of their language, script, writing system, and cultural conventions, in accordance with the W3C goal of Universal Access. One basic prerequisite to achieve this goal is to be able to transmit and process the characters used around the world in a well-defined and well-understood way... This document defines some conformance requirements for software developers and content developers that implement and use W3C specifications. It also helps software developers and content developers to understand the character-related provisions in other W3C specifications. The character model described in this document provides authors of specifications, software developers, and content developers with a common reference for consistent, interoperable text manipulation on the World Wide Web. Working together, these three groups can build a more international Web. Topics addressed include encoding identification, early uniform normalization, string identity matching, string indexing, and URI conventions. Some introductory material on characters and character encodings is also provided. Topics not addressed or barely touched include collation (sorting), fuzzy matching and language tagging. Some of these topics may be addressed in a future version of this specification. At the core of the model is the Universal Character Set (UCS), defined jointly by The Unicode Standard and ISO/IEC 10646. In this document, Unicode is used as a synonym for the Universal Character Set. The model will allow Web documents authored in the world's scripts (and on different platforms) to be exchanged, read, and searched by Web users around the world..." See also the mailing list archive.

  • [June 20, 2001] "XML Blueberry Requirements." W3C Working Draft 20-June-2001. Edited by John Cowan (Reuters). Latest version URL: http://www.w3.org/TR/xml-blueberry-req. Abstract: "This document lists the design principles and requirements for the Blueberry revision of the XML Recommendation, a limited revision of XML 1.0 being developed by the World Wide Web Consortium's XML Core Working Group solely to address character set issues." Detail: "The W3C's XML 1.0 Recommendation was first issued in 1998, and despite the issuance of many errata culminating in a Second Edition of 2001, has remained (by intention) unchanged with respect to what is well-formed XML and what is not. This stability has been extremely useful for interoperability. However, the Unicode Standard on which XML 1.0 relies has not remained static, evolving from version 2.0 to version 3.1. Characters present in Unicode 3.1 but not in Unicode 2.0 may be used in XML character data, but are not allowed in XML names such as element type names, attribute names, processing instruction targets, and so on. In addition, some characters that should have been permitted in XML names were not, due to oversights and inconsistencies in Unicode 2.0. As a result, fully native-language XML markup is not possible in at least the following languages: Amharic, Burmese, Canadian aboriginal languages, Cantonese (Bopomofo script), Cherokee, Dhivehi, Khmer, Mongolian (traditional script), Oromo, Syriac, Tigre, Yi. In addition, Chinese, Japanese, Korean (Hangul script), and Vietnamese can make use of only a limited subset of their complete character repertoires. In addition, XML 1.0 attempts to adapt to the line-end conventions of various modern operating systems, but discriminates against the convention used on IBM and IBM-compatible mainframes. XML 1.0 documents generated on mainframes must either violate the local line-end conventions, or employ otherwise unnecessary translation phases before and after XML parsing and generation. A new XML version, rather than a set of errata to XML 1.0, is being created because the change affects the definition of well-formed documents: XML 1.0 processors must continue to reject documents that contain new characters in XML names or new line-end conventions. It is presumed that the distinction between XML 1.0 and XML Blueberry will be indicated by the XML declaration..." See the 'www-xml-blueberry-comments' mailing list.

  • [May 10, 2001] "Unicode Character Database (UCD) in XML Format." Prepared by Mark Davis. From the posting to 'unicode@unicode.org' 2001-05-10, 'Subject: UCD in XML': "Several people asked me over the last month about the XML version of the Unicode character database that I presented at last November's UTC meeting. I posted it at http://www.macchiato.com/utc/UCD.zip, containing two files: UCD.xml and UCD-Notes.htm. Caveats: (1) I regenerated the data with Unicode 3.1 data. However, (a) I haven't done more than spot-check the results, and (b) the format differs somewhat from what is documented in the notes; (2) I still have to comment out characters FFF9..FFFD, and all surrogates, so that people can read the file with Internet Explorer (I do wish they would use a conformant XML parser). Also, note that IE takes quite a while to load the file... Format: The Unicode blocks are provided as a list of <block .../> elements, with attributes providing the start, end, and name. Each assigned code point is a <e .../> element, with attributes supplying specific properties. The meaning of the attributes is specified below. There is one exception: large ranges of code points  for characters such as Hangul Syllables are abbreviated by indicating the start and end of the range. Because of the volume of data, the attribute names are abbreviated. A key explains the abbreviations, and relates them to the fields and values of the original UCD semicolon-delimited files. With few exceptions, the values in the XML are directly copied from data in the original UCD semicolon-delimited files. Those exceptions are described below... Numeric character references (NCRs) are used to encode the Unicode code points. Some Unicode code points cannot be transmitted in XML, even as NCRs (see http://www.w3.org/TR/REC-xml#charsets), or would not be visibly distinct (TAB, CR, LF) in the data. Such code points are represented by '#xX;', where X is a hex number. Attribute Abbreviations: To reduce the size of the document, the following attribute abbreviations are used. If an attribute is missing, that means it gets a default value. The defaults are listed in parentheses below. If there is no specific default, then a missing attribute should be read as N/A (not applicable). A default with '=' means the default is the value of another other field (recursively!). Thus if the titlecase attribute is missing, then the value is the same as the uppercase. If that in turn is missing, then the value is the same as the code point itself. For a description of the source files, see UnicodeCharacterDatabase.html. That file also has links to the descriptions of the fields within the files. Since the PropList values are so long, they will probably also be abbreviated in the future." [cache]

  • [March 30, 2001]   Unicode Technical Committee Publishes Final Version of The Unicode Standard, Version 3.1.    Mark Davis, President of the Unicode Board of Directors, announced the 'final version' release of The Unicode Standard, Version 3.1. The primary feature of Unicode 3.1 is the addition of 44,946 new encoded characters. Together with the 49,194 already existing characters in Unicode 3.0, that comes to a grand total of 94,140 encoded characters in Unicode 3.1. The new characters cover several historic scripts, several sets of symbols, and a very large collection of additional CJK ideographs. Unicode 3.1 also features new character properties, and assignments of property values for the much expanded repertoire of characters. All errata and corrigenda to Unicode 3.0 and Unicode 3.0.1 are included in this specification, together with significant enhancements of the Unicode conformance clauses and additions to other sections of the standard. The base documentation for Unicode 3.1 can be found online at the Unicode web site." [Full context]

  • [January 17, 2001] Unicode 3.1 Published Online as Proposed Draft Unicode Technical Report. The Unicode Consortium has published Proposed Draft Unicode Technical Report #27: Unicode 3.1. Reference: Version 1.0, 'http://www.unicode.org/unicode/reports/tr27/tr27-1, 2000-01-17; edited by Mark Davis, Michael Everson, Asmus Freytag, Lisa Moore, et al. Document summary: "This document defines Version 3.1 of the Unicode Standard. It overrides certain features of Unicode 3.0.1, and adds a large numbers of coded characters. This draft is for review with the intention of it becoming a Unicode Standard Annex." The specification has been approved by the Unicode Technical Committee for public review; it is a 'Proposed Draft', to be taken as "a work in progress." Details: "The primary feature of Unicode 3.1 is the addition of 44,946 new encoded characters. These characters cover several historic scripts, several sets of symbols, and a very large collection of additional CJK ideographs. For the first time, characters are encoded beyond the original 16-bit codespace or Basic Multilingual Plane (BMP or Plane 0). These new characters, encoded at code positions of U+10000 or higher, are synchronized with the forthcoming standard ISO/IEC 10646-2. Unicode 3.1 and 10646-2 define three new supplementary planes. Unicode 3.1 also features corrected contributory data files, to bring the data files up to date against the much expanded repertoire of characters. All errata and corrigenda to Unicode 3.0 and Unicode 3.0.1 are included in this specification. Major corrigenda and other changes having a bearing on conformance to the standard are listed in Article 3, Conformance. Other minor errata are listed in Article 5, Errata. Most notable among the corrigenda to the standard is a tightening of the definition of UTF-8, to eliminate a possible security issue with non-shortest-form UTF-8." The TR provides charts which contain the characters added in Unicode 3.1. They are shown together with the characters that were part of Unicode 3.0. New characters are shown on a yellow background in these code charts. They include: (1) Greek and Coptic; (2) Old Italic; (3) Gothic; (4) Deseret; (5) Byzantine Musical Symbols; (6) Musical Symbols; (7) Mathematical Alphanumeric Symbols; (8) CJK Unified Ideographs Extension B; (9) CJK Compatibility Ideographs Supplement; (10) Tag Characters. Note Section '13.7 Tag Characters', which provides clarification on the restricted use of 'Tag Characters U+E0000-U+E007F: "The characters in this block provide a mechanism for language tagging in Unicode plain text. The characters in this block are reserved for use with special protocols. They are not to be used in the absence of such protocols, or with protocols that provide alternate means for language tagging, such as markup. The requirement for language information embedded in plain text data is often overstated...This block encodes a set of 95 special-use tag characters to enable the spelling out of ASCII-based string tags using characters which can be strictly separated from ordinary text content characters in Unicode. These tag characters can be embedded by protocols into plain text. They can be identified and/or ignored by implementations with trivial algorithms because there is no overloading of usage for these tag characters--they can only express tag values and never textual content itself. In addition to these 95 characters, one language tag identification character and one cancel tag character are also encoded. The language tag identification character identifies a tag string as a language tag; the language tag itself makes use of RFC 1766 language tag strings spelled out using the tag characters from this block...Because of the extra implementation burden, language tags should be avoided in plain text unless language information is required and it is known that the receivers of the text will properly recognize and maintain the tags. However, where language tags must be used, implementers should consider the following implementation issues involved in supporting language information with tags and decide how to handle tags where they are not fully supported. This discussion applies to any mechanism for providing language tags in a plain text environment...Language tags should also be avoided wherever higher-level protocols, such as a rich-text format, HTML or MIME, provide language attributes. This practice prevents cases where the higher-level protocol and the language tags disagree." See Unicode in XML and other Markup Languages [Unicode Technical Report #20 == W3C Note 15-December-2000].

  • [December 04, 2000] Mark Davis posted an announcement for the publication of the Unicode Character Mapping Markup Language (CharMapML) as a full Technical Report. Reference: Unicode Technical Report #22, by Mark Davis (with contributions from Kent Karlsson, Ken Borgendale, Bertrand Damiba, Mark Leisher, Tony Graham, Markus Scherer, Peter Constable, Martin Duerst, Martin Hoskin, and Ken Whistler). This Unicode technical report "specifies an XML format for the interchange of mapping data for character encodings. It provides a complete description for such mappings in terms of a defined mapping to and from Unicode, and a description of alias tables for the interchange of mapping table names." The Unicode Technical Committee "intends to continue development of this TR to also encompass complex mappings such as 2022 and glyph-based mappings." Background: "The ability to seamlessly handle multiple character encodings is crucial in today's world, where a server may need to handle many different client character encodings covering many different markets. No matter how characters are represented, servers need to be able to process them appropriately. Unicode provides a common model and representation of characters for all the languages of the world. Because of this, Unicode is being adopted by more and more systems as the internal storage processing code. Rather than trying to maintain data in literally hundreds of different encodings, a program can translate the source data into Unicode on entry, process it as required, and translate it into a target character set on request. Even where Unicode is not used as a process code, it is often used as a pivot encoding. Data can be converted first to Unicode and then into the eventual target encoding. This requires only a hundred tables, rather than ten thousand. Whether or not Unicode is used, it is ever more vital to maintain the consistency of data across conversions between different character encodings. Because of the fluidity of data in a networked world, it is easy for it to be converted from, say, CP930 on a Windows platform, sent to a UNIX server as UTF-8, processed, and converted back to CP930 for representation on another client machine. This requires implementations to have identical mappings for a character encoding, no matter what platform they are working on. It also requires them to use the same name for the same encoding, and different names for different encodings. This is difficult to do unless there is a standard specification for the mappings so that it can be precisely determined what the encoding actually maps to. This technical report provides such a standard specification for the interchange of mapping data for character encodings. By using this specification, implementations can be assured of providing precisely the same mappings as other implementations on different platforms The report references several related data files, including (1) DTD file for the Character Mapping Data format [CharacterMapping.dtd]; (2) DTD file for the Character Mapping Alias format [CharacterMappingAliases.dtd]; (3) Sample mapping file [SampleMappings.xml]; (4) Sample alias file [SampleAliases.xml]; (5) Sample alias file #2 [SampleAliases2.xml].

  • [October 20, 2000] "Character Mapping Tables." By Mark Davis. Version 2.1, 2000-08-31. Draft Unicode Technical Report #22. Reviewed by Kent Karlsson, Ken Borgendale, Bertrand Damiba, Mark Leisher, Tony Graham, and Ken Whistler. From the Unicode Technical Reports series. "Summary: This document specifies an XML format for the interchange of mapping data for character encodings. It provides a complete description for such mappings in terms of a defined mapping to and from Unicode." Background: "The ability to seamlessly handle multiple character encodings is crucial in today's world, where a server may need to handle many different client character encodings covering many different markets. No matter how characters are represented, servers need to be able to process them appropriately. Unicode provides a common model and representation of characters for all the languages of the world. Because of this, Unicode is being adopted by more and more systems as the internal storage processing code. Rather than trying to maintain data in literally hundreds of different encodings, a program can translate the source data into Unicode on entry, process it as required, and translate it into a target character set on request. Even where Unicode is not used as a process code, it is often used as a pivot encoding. Rather than requiring ten thousand tables to map each of a hundred character encodings to one another, data can be converted first to Unicode and then into the eventual target encoding. This requires only a hundred tables, rather than ten thousand. Whether or not Unicode is used, it is ever more vital to maintain the consistency of data across conversions between different character encodings. Because of the fluidity of data in a networked world, it is easy for it to be converted from, say, CP930 on a Windows platform, sent to a UNIX server as UTF-8, processed, and converted back to CP930 for representation on another client machine. This requires implementations to have identical mappings for different character encodings, no matter what platform they are working on. It also requires them to use the same name for the same encoding, and different names for different encodings. This is difficult to do unless there is a standard specification for the mappings so that it can be precisely determined what the encoding maps actually to. This technical report provides such a standard specification for the interchange of mapping data for character encodings. By using this specification, implementations can be assured of providing precisely the same mappings as other implementations on different platforms." Example of XML Formats: "A character mapping specification file starts with the following lines [...] There is a difference between the encoding of the XML file, and the encoding of the mapping data. The encoding of the file can be any valid XML encoding. Only the ASCII repertoire of characters is required in the specification of the mapping data, but comments may be in other character encodings. The example below happens to use UTF-8... The mapping names table is a separate XML file that provides an index of names for character mapping tables. For each character mapping table, they provide display names, aliases, and fallbacks... the main part of the table provides the assignments of mappings between byte sequences and Unicode characters." See the example XML DTD from the exposition. [cache]

  • [June 23, 2000] A fourth revised draft of Unicode in XML and other Markup Languages has been issued by the Unicode Consortium and W3C. Reference: DRAFT Unicode Technical Report #20; W3C Working Draft 23-June-2000. By Martin Dürst and Asmus Freytag. This W3C Working Draft is being developed jointly by the W3C Internationalization Working Group/Interest Group in the context of the W3C Internationalization Activity and by the Unicode Technical Committee. The revised draft document "contains guidelines on the use of the Unicode Standard Version 3.0 in conjunction with markup languages such as XML. . . it now covers all affected characters in the Unicode Standard, Version 3.0." Background: "The Unicode Standard is the universal character set. Its primary goal is to provide an unambiguous encoding of the content of plain text, ultimately covering all languages in the world. Currently in its third major version, Unicode contains a large number of characters covering most of the currently used scripts in the world. It also contains additional characters for interoperability with older character encodings, and characters with control-like functions included primarily for reasons of providing unambiguous interpretation of plain text. Unicode provides specifications for use of all of these characters. For document and data interchange, the Internet and the World Wide Web are more and more making use of marked-up text such as HTML and XML. In many instances, markup provides the same, or essentially similar features to those provided by format characters in the Unicode Standard for use in plain text. Another special character category provided by Unicode are compatibility characters. While there may be valid reasons to support these characters and their specifications in plain text, their use in marked-up text can conflict with the rules of the markup language. Formatting characters are discussed in chapters 2 and 3, compatibility characters in chapter 4. The issues of using Unicode characters with marked-up text depend to some degree on the rules of the markup language in question and the set of elements it contains. In a narrow sense, this document concerns itself only with XML, and to some extent HTML. However, much of the general information presented here should be useful in a broader context, including some page layout languages. . . There are several general points to consider when looking at the interaction between character encoding and markup: (1) Linearity of text vs. hierarchy of markup structure; (2) Overlap of control codes and markup semantics; (3) Coincidence of semantic markup and functions; (4) Extensibility of markup; (5) Markup vs. Styling. . ." See the following entry.

  • [April 30, 2000] Unicode in XML and other Markup Languages. DRAFT Unicode Technical Report #20 (3). W3C Working Draft XX-xxxx-2000. Unicode Revision 3, TR20-3 [viz., 'http://www.unicode.org/unicode/reports/tr20/tr20-3.html'] Unicode date: 2000-04-27. [Edited] By Martin Dürst (mduerst@w3.org) and Asmus Freytag (asmus@unicode.org). The revision 3 changes are substantial: "Added sections 2.1-2.6 (MJD), sections 3.1-3.5, and 3.8, as well as sections 4.4-4.6 and 8 (AF). Edited text for publication as DRAFT Unicode Technical Report (AF)." Summary: "This document contains guidelines on the use of the Unicode Standard Version 3.0 in conjunction with markup languages such as XML. The Unicode Standard [Unicode] is the universal character set. Its primary goal is to provide an unambiguous encoding of the content of plain text, ultimately covering all languages in the world. Currently in its third major version, Unicode contains a large number of characters covering most of the currently used scripts in the world. It also contains additional characters for interoperability with older character encodings, and characters with control-like functions included primarily for reasons of providing unambiguous interpretation of plain text. Unicode provides specifications for use of all of these characters. For document and data interchange, the Internet and the World Wide Web are more and more making use of marked-up text. In many instances, markup provides the same, or essentially similar features to those provided by formatting characters in the Unicode Standard for use in plain text. While there may be valid reasons to support these characters and their specifications in plain text, their use in marked-up text can conflict with the rules of the markup language. The issues of using Unicode characters with marked-up text depend to some degree on the rules of the markup language in question and the set of elements it contains. In a narrow sense, this document concerns itself only with XML and to some extent HTML, however, much of the general information presented here should be useful in a broader context, including some page layout languages... There are several general points to consider when looking at the interaction between character encoding and markup: (1) Linearity of text vs. hierarchy of markup structure; (2) Overlap of control codes and markup semantics; (3) Coincidence of semantic markup and functions; (4) Extensibility of markup; (5) Markup vs. Styling." [cache]

  • [September 28, 1999] Unicode Technical Committee and W3C Publish Unicode in XML and other Markup Languages. Unicode in XML and other Markup Languages is "a W3C Working Draft worked on jointly by the W3C Internationalization Working Group/Interest Group and the Unicode Technical Committee. References: Proposed DRAFT Unicode Technical Report #20, Revision 2 == W3C Working Draft 28-September-1999. Unicode URL: www.unicode.org/unicode/reports/tr20/tr20-2.html. By Martin Dürst, Mark Davis, and Asmus Freytag. The working draft "contains guidelines on the use of the Unicode Standard in conjunction with markup languages such as XML. The material in this draft is still in a rather early stage. Currently the draft shows the approximate range of intended coverage (e.g., in terms of which kinds of characters will be addressed, and what kind of information that is intended to be provided for each kind), while large parts still need more work and discussion. It is not exactly clear yet what the exact proposal for each character may be, and how this document will be related to other W3C specifications." Background to the joint TR: "The Unicode Standard contains a large number of characters in order to cover the scripts of the world. It also contains characters for compatibility with older character encodings, and characters with control-like functions included for various reasons. It also provides specifications for use of these characters. For document and data interchange, the Internet and the World Wide Web are more and more making use of marked-up text. In many instances, markup provides the same, or essentially similar features to those provided by formatting characters in the Unicode Standard for use in plain text. While there may be valid reasons to support these characters and their specifications in plain text, their use in marked-up text can conflict with the rules of the markup language. . ." The document is to be understood within the framework of the standard and other technical reports.

  • XML Japanese Profile. W3C Note 14-April-2000. Latest version URL: http://www.w3.org/TR/japanese-xml. Edited by MURATA Makoto (Fuji Xerox Information Systems Co.,Ltd). With contributions by the INSTAC XML S-WG and others: KOMACHI Yushi (Panasonic), KAWAMATA Akira (Piedey), HIYAMA Masayuki (Hiyama Office), UCHIYAMA Mitsukazu (Toshiba), KAMIMURA Keisuke (GLOCOM), OKUI Yasuhiro (Unitec Corporation), IMAGO Satosi (RICOH), HANADA Takako [Translator], Rick JELLIFFE (Academia Sinica), François YERGEAU (Alis Technologies). Abstract: "XML Japanese Profile addresses the issues of using Japanese characters in XML documents. In particular, ambiguities in converting existing Japanese charsets to Unicode are clearly pointed out." XML Japanese Profile was originally published by Japanese Standards Association (JSA) in the Japanese language. It is not a standard but rather a technical report, which is intended to encourage public discussion, promote consensus among relevant parties, and eventually become a Japanese Industrial Standard (JIS), if appropriate. JIS TR X 0015 was developed by the XML special working group (XML SWG) of Information Technology Research and Standardization Center (INSTAC), JSA . This specification was created by first translating JIS TR X 0015 and then revising it on the basis of comments from some I18N experts. The original specification, JIS TR X 0015, will be accordingly revised and republished by JSA in the near future. The XML SWG intends to keep this document and JIS TR X 0015 in sync..."

  • [May 24, 2000] Unicode: A Primer. By Tony Graham. Web developers and programmers now have access to an authoritative and well-written guide to Unicode, thanks to the recent publication of Unicode: A Primer. Written by Tony Graham (Mulberry Technologies), Unicode: A Primer "is the first book devoted to the Unicode Standard Version 3.0 and its applications (other than the standard itself)." The endorsement of the book by Rick McGowan, a Unicode Consortium Technical Director, speaks volumes: "For developers who plan to use the Unicode Standard, this is the best companion book so far." The Unicode standard, as described by Tony Graham on his Unicode web site, "is a character encoding standard published by Unicode Consortium. Unicode is designed to include all of the major scripts of the world in a simple and consistent manner. The Unicode Standard, Version 3.0, defines 49,194 characters from over 90 scripts. It covers alphabetic, syllabic, and ideographic scripts, including Latin scripts, Greek, Cyrillic, Thai, ideographs unified from the scripts of China, Japan, and Korea, and Hangul characters used for writing Korean. The Unicode Standard also defines properties of the characters and algorithms for use in implementations of the standard. Every major operating system, many programming languages, and many applications support the Unicode Standard." The new guide to Unicode implementation is a book that needed to be written. Tony Graham is eminently qualified to be its author: he has worked intimately with Unicode and other character encoding standards since 1994, and has written several key articles on Unicode. Part I of Unicode: A Primer includes "Introducing Unicode and ISO/IEC 10646" (a first look at the Unicode Standard, ISO/IEC 10646, and the Unicode Consortium) and "Unicode Design Basis and Principles." Part II (Chapters 3-8) gets to the heart of Unicode and related materials standardized by the Unicode Consortium. It provides three views of the structure of the Unicode Standard (by character block, by the files in the Unicode Character Database, and by the ISO/IEC 10646 view of the Universal Character Set); also: summaries of the features of the UCS-4, UCS-2, UTF-16, UTF-7, UTF-8, UTF-EBCDIC, and UTF-32 encodings and of endianness, transcoding, and the Standard Compression Scheme for Unicode (SCSU); an overview of the properties that a single character can have; things you need to know when working with sequences of characters; descriptions of the principles that guided encoding of the CJK ideographs and Korean Hangul characters in the Unicode Standard; conformance requirements for the Unicode Standard and ISO/IEC 10646, plus details of how to submit new scripts. Part III explains the use of the Unicode standard, particularly in Internet applications. The author includes descriptions and sample programs demonstrating Unicode support in nine programming languages. The book also has four valuable appendices (tables; descriptions of each of the character blocks in Unicode 3.0; information about the Unicode Consortium, versions of the Unicode Standard, Unicode Technical Reports, and Unicode conferences; tables of ISO/IEC 10646 amendments, blocks, and subsets), glossary, index, and bibliography. The book's complete Table of Contents, together with links to Unicode resources, is published on the companion Web site. Publication details: Unicode: A Primer, Foster City, CA: [M&T Books, An imprint of] IDG Books Worldwide, 2000. ISBN: 0-7645-4625-2. lii + 476 pages.

  • [April 20, 2000] "Unicode: What is it, and how do I use it?" By Tony Graham. In Markup Languages: Theory and Practice [ISSN: 1099-6622]. Volume 1, Number 4 (Fall 1999), pages 75-102. "The rationale for Unicode and its design goals and detailed design principles are presented. The correspondence between Unicode and ISO/IEC 10646 is discussed, the scripts included or planned for inclusion in the two character set standards are listed. Some products that support Unicode and some applications that require Unicode are listed, then examples of how to specify Unicode characters in a variety of applications are given. Use of Unicode in SGML and XML applications is discussed, and the paper concludes with descriptions of the character encodings used with Unicode and ISO/IEC 10646, plus sources of further information are listed." For other articles in Markup Languages: Theory and Practice Volume 1, Issue 4 (Fall 1999), see the annotated Table of Contents document.

  • [June 23, 2000] "Globalizing e-commerce. Open standards like XML and Unicode are promoting truly global software." By Jim Melnick (President, Internet Interactive Services). From IBM DeveloperWorks. June 2000. ['See how open standards like XML and Unicode are helping to open wallets as e-tailing spreads across the planet; The combination of eXtensible Markup Language (XML), XML-enabled browsers, and Unicode fonts will soon make some forms of multilingual e-commerce possible. That prospect could bring about another Internet revolution. Open standards will play a critical role in producing software that is ready for the global economy. Jim Melnick describes the building blocks that will be used to construct multilingual e-commerce applications. These critical components of any global business strategy will come to fruition with a wider use and understanding of XML, the proliferation of XML-enabled browsers, and the use of Unicode as the universal encoding standard upon which truly global software can be built.'] "The nexus of Unicode and XML: Bringing Unicode and the properties of XML together now brings us to the nexus where multilingual applications can begin to take off. One of the most promising initial areas will probably be multilingual forms. These should be fairly easy to produce, will provide a mechanism with which to collect and synthesize real data from different language sets, and on that basis, will provide a foundation for eventually moving to true multilingual e-commerce. How will this work? We begin with an XML/Unicode-enabled browser pulling down a hypothetical multilingual Web site. Let's assume the user sees ten boxes to choose from, each described in a different language. The purpose of the site could be a business survey assessing to what degree the user is Internet-savvy. The user clicks on the box in his or her native language, and that takes the person to another page entirely in that language. To keep it simple, first-level types of responses are 'yes/no' or multiple-choice answers. Though the questions may be framed slightly differently to reflect cultural variances, they all ask the same thing in each language, and the range of possible answers is the same. Now XML enters the scene. The results can be tabulated across all the languages as if they were all from the same language, according to whatever XML schema has been previously devised. Then the data can be collated and manipulated across various languages according to whatever elements have been set up: <UserAge>, <NumberofComputersOwned>, <ModemSpeed>, <TypeofISP>, etc. This is a pretty powerful combination. Internet statisticians and Web advertisers are now overwhelming us (and themselves) with primarily English language-based data about our Web habits, our likes and dislikes, numbers of site impressions, and a whole host of related information. . ."

  • XML based Locale Data. [And: "Universal Locales for Linux."] From the Unicode 18 description: "We've developed 140 or more Unicode POSIX locales for Linux glibc (GNU C runtime library) using Unicode online databases, collation keys and a XML based Locale data. They have been provided to Open source community by Li18nux and IBM as IBM Public Licence. Some of them already have been packaged in glibc V2.2. The functional objective of this development is the following. (1) To generate character properties from the Unicode Character Database; (2) To generate collation data from Unicode Collation Keys database; (3) To generate other locale data from a XML based locale definition database; (4) To conform to Linux 2000 specification. Note that the XML based locale definition database above is created from the ICU (Internationalization Class for Unicode) data. ICU, Java and POSIX data in IBM will be all maintained through this XML format locale. In this paper, we describe the overall of this project and technical methodology used to develop this locale data. We also introduce tools to display this locale data for verification, and XML based locale editor which can be used to modify it." "Open Source Project for Unicode Locales for Linux using Unicode Databases, Collation Keys and XML based Locale Data," by Kentaro Noji - IBM Japan, Ltd.

  • [April 22, 1998] "10646 and All That. Unicode, ISO 10646, and the Quest for a Universal Character Set." By Tony Graham. Slides (in HTML) from a tutorial presentation on Unicode, given at the Washington SGML Users Group, Washington, D.C., April [15], 1998. Discusses (also) the use of Unicode with SGML, XML, DSSSL, and XSL. See other papers online from Mulberry Technologies Inc.

  • Unicode 3.0 and Surrogate Block in XML - Notes by Tony Graham and Tim Bray.

  • [May 03, 2000] The skew.org XML Tutorial. A reintroduction to XML with an emphasis on character encoding." By Mike J. Brown. 2000-05-01 [or later].

  • [April 28, 2000] "Character Encodings in XML and Perl." By Michel Rodriguez. From XML.com (April 26, 2000). ['One troublesome area of XML -- that often makes people close their eyes and hope the problem will go away -- is character encodings. In his article "Character Encodings in XML and Perl," Michel Rodriguez provides an overview of encodings and XML, and gives plenty of practical examples of how you can handle and convert between character encodings using Perl. After reading his article, you'll be skipping happily from Big5 to Unicode without batting an eyelid... This article examines how to handle character encodings with XML and Perl: which encodings are handled natively, converting to and from Unicode, and what to do when your tools don't support Unicode.'] "This article examines the handling of character encodings in XML and Perl. I will look at what character encodings are and what their relationship to XML is. We will then move on to how encodings are handled in Perl, and end with some practical examples of translating between encodings... in order to encode text or data, you first need to specify an encoding for it. The most common of all encodings (at least in Western countries) is without a doubt ASCII. Other encodings you may have come across include the following: EBCDIC, which will remind some of you of the good old days when computer and IBM meant the same thing; Shift-JIS, one of the encodings used for Japanese characters; and Big 5, a Chinese encoding. What all of these encodings have in common is that they are largely incompatible. There are very good reasons for this, the first being that Western languages can live with 256 characters, encoded in 8-bits, while Eastern languages use many more, thus requiring multi-byte encodings. Recently, a new standard was created to replace all of those various encodings: Unicode, a.k.a. ISO 10646..."

  • [September 18, 1997] Superb article on Unicode, for XML/SGML developers: "Unicode and Internationalization Issues in Document Management: A Global Solution to Local Problems," by François Chahuneau, general manager of AIS/Berger-Levrault. The Gilbane Report Volume 5, Number 4 (July/August 1997) 1-25. See the bibliographic entry for other details.

  • "Towards a Truly Worldwide Web. How XML and Unicode are making it easier to publish multilingual electronic documents." By Stuart Culshaw. MultiLingual Communications and Technology. "The Web was originally designed around the ISO 8859-1 character set, which supports only Western European languages. In the early days, when the development of the Web was mainly in the US, this was not a problem, but with the growth in the use of the Internet worldwide, the number of people attempting to distribute non-English content over the Web has grown substantially. In addition, the ability to provide localized content has become an important source of competitive advantage for companies competing in the global market place. The need for more robust standards and protocols to support multilingual publishing on the Internet has become of prime importance. The recent introduction of a number of new Web technologies and standards has gone some way to improving the situation, but this is more than just a character encoding or a font display problem. The Web is a whole new medium that goes far beyond the possibilities of traditional publishing. The frontier between document content and application user interface is increasingly blurred and documents are becoming applications in themselves. These 'dynamic' documents contain a mixture of both document content and information about the content, or metadata. What is needed is a way to meet the needs of today's professional Web publishers and those of tomorrow's dynamic document application architectures. This is where the Extensible Markup Language (XML) comes in. XML looks set to make large-scale hypertext document publishing to a worldwide audience a reality at last. At the same time it will make the life of the multilingual document publisher a whole lot easier. . . [The complete article is available in MultiLingual Communications & Technology, Volume 9, Issue 3.] (cache)

  • [August 31, 2000] Speaking in Charsets: Building a Multilingual Web Site." By John Yunker. In WebTechniques Volume 5, Issue 9 (September 2000). ['Creating Japanese Web pages presents its own unique set of challenges. John guides you through the quagmire of character sets, encodings, glyphs, and other mysterious elements of written language.'] "There are many character sets from which to choose, including anything from Western to Cyrillic. When working with different languages, you'll need to understand the different character sets and the ways computers manipulate and display them. A character can be a tough concept to grasp if all you've ever worked with is English. Characters are not just letters of the alphabet. You have to be careful not to confuse a character with a glyph, which is the visual representation of a character. For example, the letter Z is a character, but it may be represented by a number of different glyphs. In the Times New Roman font, a Z looks much different from the way it looks in the Bookman font. . . Character sets by themselves don't mean much to computers unless they're attached to encodings that describe how to convert from bits to characters. For example, under ASCII, the number 65 represents the letter A. This mapping from number to letter is referred to as the character encoding. A computer must be told which encoding is being used, then it simply matches the number to the character..."

  • See also: ISO 8879: Character Sets and Multilingual Text, including Extended Reference Concrete Syntaxes (ERCS)


Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation

Primeton

XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Globe Image

Document URI: http://xml.coverpages.org/unicode-xml.html  —  Legal stuff
Robin Cover, Editor: robin@oasis-open.org