office-formula message

Subject: RE: [office-formula] Summary 2010-09-07 - IRI vs URI

From: "Dennis E. Hamilton" <dennis.hamilton@acm.org>
To: "'Andreas J. Guelzow'" <andreas.guelzow@concordia.ab.ca>
Date: Tue, 7 Sep 2010 21:55:05 -0700

Andreas,

Thanks for pointing out that language in [RFC3987].

Here's my extended analysis of what is involved with that.  

I believe the statement patently incorrect if read to mean that IRIs and
URIs are co-extensive and there is nothing that needs to be done.  The IRI
specification would be very short and not have to pay attention to mappings
and when and how they apply were it literally true.  They would also not
require a separate grammar for IRIs.

Whatever the case, I believe it is necessary to say exactly how we expect
IRIs to be mapped to URIs, what the admissable IRIs are in the case of
relative references to package files and subdocuments, and what the
corresponding manifest:full-path and Zip directory file names are.

GRANDFATHERED URIS VS. UNICODE IRIS

Here is the context in section 3.1, revealingly entitled "Mapping of IRIs to
URIs":

"The above mapping from IRIs to URIs produces URIs fully conforming to
   [RFC3986].  The mapping is also an identity transformation for URIs
   and is idempotent;  applying the mapping a second time will not
   change anything.  Every URI is by definition an IRI."

There is no question the mapping is idempotent as defined in [RFC3987],
because the resulting URI has no disallowed ASCII-character encodings and so
running the mapping again changes nothing.  That is to say, the mapping is
an identity transformation for IRIs that are already well-formed URIs.

It is in that sense that I say IRIs are subsets of URIs or, put better, the
image of admissable IRIs that are not already syntactically well-formed URIs
is a subset of the URIs.  Also, there is the usual problem of mappings of
this nature in that there is no assured inversion from an IRI-mapped URI
back to an IRI that is not the URI.

This is also the sense in which I believe the statement "Every URI is by
definition an IRI" is at best misleading and at worse simply incorrect,
since there are well-formed URIs that can never be produced from IRIs that
are entirely in Unicode and that only %-encode Basic Latin characters that
are not permitted as single-character <pchar>s.  It is further misleading if
taken to mean that a syntactical IRI can be used where URIs are required.
There are many places where well-formed URIs are required (e.g., the XML
Schema anyURI datatype and elsewhere).  

WHY FUSS ABOUT THIS?

To ensure that the mapping can be accurately inverted, it is necessary to
restrict what %-encoded bytes are allowed in URIs and which are to be
employed in reconstructing a non-URI IRI, if any, that is presumably the
inverse mapping of the URI in hand.  This is a strong constraint, because it
signifies to me that only Unicode is carried by IRIs (where URIs do not
necessarily have that limitation on how %-encoded bytes with values greater
than %7f are to be understood).  The considerations for assuring an inverse
mapping are reflected in section 3.2 and perhaps elsewhere in [RFC3987].  

I assume, to satisfy the requirement that ODF support IRIs at all, one needs
to ensure that the IRI using non-allowed URI characters can be recoverable
to satisfy whatever use cases there are in mind by those who require that
IRIs be supported in naming of package files and in URI references
generally.  Since the requirement came from JTC1 National Body Japan, I
presume that it is desirable to see the IRIs with the actual CJK characters
whose Unicode code points are IRI-encoded in the naming of package files and
in the introduction of URIs in various ways in ODF documents.  How packages
could be extracted into file systems allowing CJK character encodings in
file and directory names is, of course, outside of our control beyond
providing interested parties a consistent way of interpreting the Zip file
names that ODF restricts itself to.

It is on behalf of that requirement that I believe it is important to
ascertain, within ODF, when an IRI must be mapped to a URI and the form of
URI be used.  This matters, in particular, for any relative references that
involves segments that are part of ODF Package manifest:full-path values and
that need to match what is used in the Zip directory entry for a package
file and/or the ODF notion of an identified package subdocument.  My
recommendation is that the manifest:full-path always be fully IRI encoded
(even though it is neither IRI nor URI) and likewise for the corresponding
Zip directory entry, when there is one.  Furthermore, the only %-encodings
should be what is required for this purpose and the only unencoded
characters should be a limited subset of what is used in URIs.  My
recommendation is to allow only non-empty segment names having only <pchar>s
without ":", perhaps without "@", and with "/" as the segment separator.
There should be no "." and ".." segments, since these only exist in URI
references and are inappropriate in Zip directory file names).


 - Dennis

-----Original Message-----
From: Andreas J. Guelzow [mailto:andreas.guelzow@concordia.ab.ca] 
Sent: Tuesday, September 07, 2010 17:07
To: dennis.hamilton@acm.org
Cc: dwheeler@dwheeler.com; office-formula@lists.oasis-open.org
Subject: RE: [office-formula] Summary 2010-09-07 - IRI vs URI

[ ... ]

Of course more importantly:
http://www.ietf.org/rfc/rfc3987.txt states in the third last paragraph
on page 10:
"Every URI is by definition an IRI."

And I think we should consider RFC3987 as authoritative on this matter.

Andreas 




---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail.  Follow this link to all your TCs in OASIS at:
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php

Follow-Ups:
- RE: [office-formula] Summary 2010-09-07 - IRI vs URI
  - From: "Andreas J. Guelzow" <andreas.guelzow@concordia.ab.ca>

References:
- Summary 2010-09-07 of OpenFormula meeting
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>
- RE: [office-formula] Summary 2010-09-07 - IRI vs URI
  - From: "Dennis E. Hamilton" <dennis.hamilton@acm.org>
- RE: [office-formula] Summary 2010-09-07 - IRI vs URI
  - From: "Andreas J. Guelzow" <andreas.guelzow@concordia.ab.ca>