Which CoderResult for malformed surrogate pairs ?
Ulf Zibis
Ulf.Zibis at gmx.de
Wed Sep 10 15:22:59 UTC 2008
Hi Martin,
thanks for the quick first answer.
You are right, both chars could be corrupt.
IMO, if CoderResult.malformedForLength(2) would be returned, this would
be more informative, and the SW developer could decide by himself, if he
would consider the CoderResult.length().
Why having this differentiation by length, if nobody makes use of it?
There is no other cause, which would entail a length other than 1 from
CharsetEncoder.
So do you think, it would be against spec to return a
CoderResult.malformedForLength(2) in such cases, even if
CoderResult.malformedForLength(1) isn't a bug.
BTW:
The chance to erroneously receive a high surrogate in range
\uD800..\uDBFF is 1.56 %
The chance to erroneously receive a char out of range \uDC00..\uDFFF
after a correct high surrogate is 99.84 %
-Ulf
Am 09.09.2008 23:58, Martin Buchholz schrieb:
> I think when encountering a single high surrogate,
> it is correct to return a length of either 1 or 2.
> A thought experiment: a cosmic ray that mangled exactly one char
> could have caused this situation if the original sequence was
> of length either 1 or 2, depending on which char was mangled.
>
> Not a Defect.
>
> Martin
>
>
More information about the core-libs-dev
mailing list