Which CoderResult for malformed surrogate pairs ?
Martin Buchholz
martinrb at google.com
Tue Sep 9 21:58:47 UTC 2008
I think when encountering a single high surrogate,
it is correct to return a length of either 1 or 2.
A thought experiment: a cosmic ray that mangled exactly one char
could have caused this situation if the original sequence was
of length either 1 or 2, depending on which char was mangled.
Not a Defect.
Martin
On Tue, Sep 9, 2008 at 14:38, Ulf Zibis <Ulf.Zibis at gmx.de> wrote:
> Hi all,
>
> as you maybe noticed, I'm working on enhancement of sun.nio.cs package:
> https://java-nio-charset-enhanced.dev.java.net/
>
> Unicode code points > \uFFFF are synthesized in the JVM by 2 chars, called
> surrogates.
> The 1st char, called high surrogate, is in the Range of \uD800..\uDBFF, and
> the 2nd char, called low surrogate, is in the Range of \uDC00..\uDFFF, and
>
> 1.) If the 1st char is erroneously in the Range of \uDC00..\uDFFF,
> sun.nio.cs encoders return a CoderResult.malformedForLength(1). OK.
> 2.) If the 1st char is correctly in the Range of \uD800..\uDBFF, but the 2nd
> char is erroneously NOT in the Range of \uDC00..\uDFFF, sun.nio.cs encoders
> mostly (I have not tested all) also return a
> CoderResult.malformedForLength(1).
>
> IMO for the 2. case, the encoders should return
> CoderResult.malformedForLength(2), because the code point, which is wrong,
> consists of 2 chars.
> Additionally, it would be much easier to skip the wrong code point in the
> concerning java.nio.CharBuffer, by just utilizing CoderResult.length().
>
> See also:
> http://java.sun.com/javase/6/docs/api/java/nio/charset/CoderResult.html#length()
>
> What do you think about this ???
>
> I'm thinking about reporting a bug concerning this "wrong" encoder result.
>
> Thanks in advance for a brisk discussion.
>
> -Ulf
>
>
>
>
More information about the core-libs-dev
mailing list