<i18n dev> Java encoder errors
Tom Christiansen
tchrist at perl.com
Mon Sep 19 15:26:46 PDT 2011
Mark Davis ☕ <mark at macchiato.com> wrote
on Mon, 19 Sep 2011 14:41:49 PDT:
> I agree with the first part, disallowing the irregular code sequences.
Finding that Java allowed surrogates to sneak through in their UTF-8
streams like that was quite odd.
> As to the noncharacters, it would be a horrible mistake to disallow them.
> Tom, a Java code converter is far too low a level for C9; if the
> converter can't handle them, it screws up all perfectly legitimate
> *internal*interchange. C9 is really for a very high level, eg don't
> put them into interchanged plain text, like a web page. I agree that
> it needs more clarification.
Mark, thanks for taking the time to unravel that. It wasn't clear from
the specs where or perhaps even whether you should or should not disallow
the 66 noncharacter code points. A bit more clarity there would help.
You bring up an interesting point. If you read a web page and want to use
some of the noncharacter code points as sentinels per their suggested use
during your internal processing, you have to be able to know that they
weren't there to start with. Yes, you can check, one at a time, till you
(hopefully!) find enough that aren't there that you can use them. But if
that were what you had to do, then you could do that with any set of code
points not just noncharacter ones. So that doesn't seem to make sense.
People using UTF-8 or UTF-32 implementations can always steal non-Unicode
code points from above 0x1FFFFF for their own internal use *provided* they
never try to pass those along, but that won't work for UTF-16 even internally.
Is there anything that they can dependably use? It appears there is not.
It's an interesting problem, and I see that it isn't as easily solved as
I had hoped it might be. If you can't guarantee that even the 66
noncharacter code points won't be in your data stream, I'm thinking this
isn't going to be solvable at this level. It does make me wonder what
those 66 noncharacters code points really are for, then, so it's back to
rereading the specs again for me.
thanks very much,
--tom
More information about the i18n-dev
mailing list