Which CoderResult for malformed surrogate pairs ?
Ulf Zibis
Ulf.Zibis at gmx.de
Mon Sep 15 19:58:05 UTC 2008
Hi Martin,
thanks for you effort, joining this "academic" discussion. :-)
I just once more try to interpret the javadoc of class CoderResult:
*
A /malformed-input error/ is reported when a sequence of input
units is not well-formed. Such errors are described by instances
of this class whose |isMalformed|
<cid:part1.06090408.03040105 at gmx.de> method returns true and whose
|length| <cid:part2.00020508.07070700 at gmx.de> method returns *the
length of the malformed sequence*. There is one unique instance of
this class for all malformed-input errors of a given length.
A *single* HIGH_SURROGATE can't be valued as malformed. It can only be
valued in sequence with the successive char, and *the length of this
malformed sequence is 2*.
Am 13.09.2008 21:29, Martin Buchholz schrieb:
>> ... in any case, the calling code could omit interpreting the
>> CoderResult.length(), and generally skip one char in case of *malformed*, as
>> there are no other causes for length > 1.
>>
>
> Only if the caller is intimately familiar with the implementation.
> If for example the input contains LOW_SURROGATE LOW_SURROGATE,
> then malformed(2) might be returned, and the caller should skip 2.
>
No, in this case malformed(1) should be returned, as a single
LOW_SURROGATE is always wrong, and the caller should 2 times skip 1.
Only in case of HIGH_SURROGATE + !LOW_SURROGATE, malformed(2) should be
returned, and the caller in most cases isn't wrong in skipping the 2
chars, as the probability in consequence of a cosmic ray for a
HIGH_SURROGATE is 1.56 %, but for a !LOW_SURROGATE is 98.44 %.
> Any behavior change has to have a very good reason.
Yes, this is a grave argument.
But how often this behaviour change would brought to bear? In case of
interpreting the erroneous text by human, the difference IMO shouldn't
matter.
>> 2.)
>> What's about \uFFFE and \uFFFF ?
>> Here they are denoted as "Not a character":
>> http://www.decodeunicode.org/de/u+fffe
>> So IMHO they should be valued as *malformed*, but the current JDK's encoders
>> return *unmappable*.
>>
>
> I disagree. What if the target encoding has a "character"
> with exactly the same "not a character" semantics?
>
Hm, maybe.
> E.g. if the target is itself an encoding of Unicode?
> Then the "character" would be mappable.
>
AFAIK in UTF-16 \uFFFE is only valid as the first char of a sequence. If
converting to UTF-16BE or UTF-16LE it should be skipped.
In converting from whatever encoding to UTF-16 the \uFFFE should be
prefixed automatically.
> Anyways, this discussion is mostly academic.
> Convincing me personally might help your cause, ...
>
Yes, that's the reason why I investigated in this discussion.
-Ulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20080915/a0ba93c5/attachment.html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20080915/a0ba93c5/CoderResult.html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20080915/a0ba93c5/CoderResult-0001.html>
More information about the core-libs-dev
mailing list