Which CoderResult for malformed surrogate pairs ?

Ulf Zibis Ulf.Zibis at gmx.de
Tue Sep 9 21:38:38 UTC 2008


Hi all,

as you maybe noticed, I'm working on enhancement of sun.nio.cs package: 
https://java-nio-charset-enhanced.dev.java.net/

Unicode code points > \uFFFF are synthesized in the JVM by 2 chars, 
called surrogates.
The 1st char, called high surrogate, is in the Range of \uD800..\uDBFF, and
the 2nd char, called low surrogate, is in the Range of \uDC00..\uDFFF, and

1.) If the 1st char is erroneously in the Range of \uDC00..\uDFFF, 
sun.nio.cs encoders return a CoderResult.malformedForLength(1). OK.
2.) If the 1st char is correctly in the Range of \uD800..\uDBFF, but the 
2nd char is erroneously NOT in the Range of \uDC00..\uDFFF, sun.nio.cs 
encoders mostly (I have not tested all) also return a 
CoderResult.malformedForLength(1).

IMO for the 2. case, the encoders should return 
CoderResult.malformedForLength(2), because the code point, which is 
wrong, consists of 2 chars.
Additionally, it would be much easier to skip the wrong code point in 
the concerning java.nio.CharBuffer, by just utilizing CoderResult.length().

See also: 
http://java.sun.com/javase/6/docs/api/java/nio/charset/CoderResult.html#length()

What do you think about this ???

I'm thinking about reporting a bug concerning this "wrong" encoder result.

Thanks in advance for a brisk discussion.

-Ulf






More information about the core-libs-dev mailing list