saml-dev message

Subject: Re: [saml-dev] Encoding of URI in "Alternative SAML Artifact Format"
From: Yuji Sakata <sakatayu@nttdata.co.jp>
To: "Kremp, Juergen" <juergen.kremp@sap.com>, saml-dev@lists.oasis-open.org
Date: Mon, 02 Dec 2002 15:54:21 +0900

Hi, All

> chapter 9 of the Bindings document introduces an alternative format for the
>  Assertion Artifact:
> 
> TypeCode          := 0x0002
> RemainingArtifact := AssertionHandle SourceLocation
> AssertionHandle   := 20-byte_sequence
> SourceLocation    := URI
> 
> To create the artifact, Base64 is to be applied to the concatenation of 
> TypeCode and RemainingArtifact.
> Base64 uses Bytes as input.
> 
> The specification does not specify how to convert the character-like URI 
> into bytes. 

The following resources may be helpful for you
http://www.ietf.org/rfc/rfc2396.txt
2.1 URI and non-ASCII characters

I also agree that SourceLocation's URI should  require a single charset, 
define a default charset, or provide a way to indicate the  charset used.


---------------------------------------------
NTT Data Corporation
Yuji Sakata
Tel: +81-3-3523-8081
E-Mail: sakatayu@nttdata.co.jp
----------------------------------------------
------------------------------------------------------------------------
-------------

 RFC 22396 :: 2.1 URI and non-ASCII characters
 
   The relationship between URI and characters has been a source of
   confusion for characters that are not part of US-ASCII. To describe
   the relationship, it is useful to distinguish between a "character"
   (as a distinguishable semantic entity) and an "octet" (an 8-bit
   byte). There are two mappings, one from URI characters to octets, and
   a second from octets to original characters:

   URI character sequence->octet sequence->original character sequence

   A URI is represented as a sequence of characters, not as a sequence
   of octets. That is because URI might be "transported" by means that
   are not through a computer network, e.g., printed on paper, read over
   the radio, etc.

   A URI scheme may define a mapping from URI characters to octets;
   whether this is done depends on the scheme. Commonly, within a
   delimited component of a URI, a sequence of characters may be used to
   represent a sequence of octets. For example, the character "a"
   represents the octet 97 (decimal), while the character sequence "%",
   "0", "a" represents the octet 10 (decimal).

   There is a second translation for some resources: the sequence of
   octets defined by a component of the URI is subsequently used to
   represent a sequence of characters. A 'charset' defines this mapping.
   There are many charsets in use in Internet protocols. For example,
   UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences
   of characters in the repertoire of ISO 10646.

   In the simplest case, the original character sequence contains only
   characters that are defined in US-ASCII, and the two levels of
   mapping are simple and easily invertible: each 'original character'
   is represented as the octet for the US-ASCII code for it, which is,
   in turn, represented as either the US-ASCII character, or else the
   "%" escape sequence for that octet.

   For original character sequences that contain non-ASCII characters,
   however, the situation is more difficult. Internet protocols that
   transmit octet sequences intended to represent character sequences
   are expected to provide some way of identifying the charset used, if
   there might be more than one [RFC2277].  However, there is currently
   no provision within the generic URI syntax to accomplish this
   identification. An individual URI scheme may require a single
   charset, define a default charset, or provide a way to indicate the
   charset used.

   It is expected that a systematic treatment of character encoding
   within URI will be developed as a future modification of this
   specification.