<i18n dev> Java encoder errors
Xueming Shen
xueming.shen at oracle.com
Tue Sep 20 10:01:18 PDT 2011
On 09/19/2011 03:26 PM, Tom Christiansen wrote:
> Mark Davis ☕<mark at macchiato.com> wrote
> on Mon, 19 Sep 2011 14:41:49 PDT:
>
>> I agree with the first part, disallowing the irregular code sequences.
> Finding that Java allowed surrogates to sneak through in their UTF-8
> streams like that was quite odd.
>
It's said "be conservative in what you send, liberal in what you accept" :-)
Considered the surrogates in UTF-8 was still labeled as "irregular"
instead of "ill-formed" not
long time ago [1] and with its C12/D36 explicitly suggested
C12: Processes may transform irregular code unit sequences into the
equivalent well-formed
code unit sequences.
D36: As a consequence of C12, these irregular UTF-8 sequences shall not
be generated
by a conformant process._
_It does not appear to be that odd for an implementation to continue to
be "liberal"__on these
surrogates:-)
As acknowledged in TR#26, there are data over there that do use
surrogates pair in "UTF-8"
form. It would be a little inconvenient, if not odd, that you will have
to use two UTF-8 converters
to get the "unicode code" in and out, especially I would assume most
developers might not
even know CESU-8. The only thing most people would notice is that their
applications suddently
do not work on their data after upgraded from JDK N to JDK N+ 1.
_
_That said, standard is standard, if possible it's nice to follow.
-Sherman
[1]http://unicode.org/versions/corrigendum1.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110920/93d10125/attachment.html
More information about the i18n-dev
mailing list