Which CoderResult for malformed surrogate pairs ?

Martin Buchholz martinrb at google.com
Sat Sep 13 19:29:23 UTC 2008


On Sat, Sep 13, 2008 at 06:36, Ulf Zibis <Ulf.Zibis at gmx.de> wrote:
> Hi Martin,

> ... in any case, the calling code could omit interpreting the
> CoderResult.length(), and generally skip one char in case of *malformed*, as
> there are no other causes for length > 1.

Only if the caller is intimately familiar with the implementation.
If for example the input contains LOW_SURROGATE LOW_SURROGATE,
then malformed(2) might be returned, and the caller should skip 2.

> As I understand you right, returning length == 2 would not be against API
> spec. Is it ?

Right.  But that's not good enough to change things.
Any behavior change has to have a very good reason.

> 2.)
> What's about \uFFFE and \uFFFF ?
> Here they are denoted as "Not a character":
> http://www.decodeunicode.org/de/u+fffe
> So IMHO they should be valued as *malformed*, but the current JDK's encoders
> return *unmappable*.

I disagree.  What if the target encoding has a "character"
with exactly the same "not a character" semantics?
E.g. if the target is itself an encoding of  Unicode?
Then the "character" would be mappable.

Anyways, this discussion is mostly academic.
Convincing me personally might help your cause,
but it's unlikely even then that such a change would
be approved for inclusion in JDK.
I still think the existing behavior is the better choice.

Martin



More information about the core-libs-dev mailing list