Which CoderResult for malformed surrogate pairs ?

Mon Sep 15 19:58:05 UTC 2008

Hi Martin,

thanks for you effort, joining this "academic" discussion. :-)

I just once more try to interpret the javadoc of class CoderResult:

    *

      A /malformed-input error/ is reported when a sequence of input
      units is not well-formed. Such errors are described by instances
      of this class whose |isMalformed|
      <cid:part1.06090408.03040105 at gmx.de> method returns true and whose
      |length| <cid:part2.00020508.07070700 at gmx.de> method returns *the
      length of the malformed sequence*. There is one unique instance of
      this class for all malformed-input errors of a given length.

A *single* HIGH_SURROGATE can't be valued as malformed. It can only be 
valued in sequence with the successive char, and *the length of this 
malformed sequence is 2*.

Am 13.09.2008 21:29, Martin Buchholz schrieb:
>> ... in any case, the calling code could omit interpreting the
>> CoderResult.length(), and generally skip one char in case of *malformed*, as
>> there are no other causes for length > 1.
>>     
>
> Only if the caller is intimately familiar with the implementation.
> If for example the input contains LOW_SURROGATE LOW_SURROGATE,
> then malformed(2) might be returned, and the caller should skip 2.
>   

No, in this case malformed(1) should be returned, as a single 
LOW_SURROGATE is always wrong, and the caller should 2 times skip 1.
Only in case of HIGH_SURROGATE + !LOW_SURROGATE, malformed(2) should be 
returned, and the caller in most cases isn't wrong in skipping the 2 
chars, as the probability in consequence of a cosmic ray for a 
HIGH_SURROGATE is 1.56 %, but for a !LOW_SURROGATE is 98.44 %.

> Any behavior change has to have a very good reason.

Yes, this is a grave argument.
But how often this behaviour change would brought to bear? In case of 
interpreting the erroneous text by human, the difference IMO shouldn't 
matter.

>> 2.)
>> What's about \uFFFE and \uFFFF ?
>> Here they are denoted as "Not a character":
>> http://www.decodeunicode.org/de/u+fffe
>> So IMHO they should be valued as *malformed*, but the current JDK's encoders
>> return *unmappable*.
>>     
>
> I disagree.  What if the target encoding has a "character"
> with exactly the same "not a character" semantics?
>   
Hm, maybe.

> E.g. if the target is itself an encoding of  Unicode?
> Then the "character" would be mappable.
>   

AFAIK in UTF-16 \uFFFE is only valid as the first char of a sequence. If 
converting to UTF-16BE or UTF-16LE it should be skipped.
In converting from whatever encoding to UTF-16 the \uFFFE should be 
prefixed automatically.

> Anyways, this discussion is mostly academic.
> Convincing me personally might help your cause, ...
>   

Yes, that's the reason why I investigated in this discussion.

-Ulf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20080915/a0ba93c5/attachment.html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20080915/a0ba93c5/CoderResult.html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20080915/a0ba93c5/CoderResult-0001.html>