Which CoderResult for malformed surrogate pairs ?
Ulf Zibis
Ulf.Zibis at gmx.de
Sat Sep 13 13:36:52 UTC 2008
Hi Martin,
1.)
as I understand you right, you would prefer to present the human reader
as many characters as possible.
Regarding presenting to the human:
In my experience, text is better readable if some characters are
omitted, than if there are additional invalid characters. Abbreviations
profit from this actuality. Human can easily interpolate missing
characters. (maybe this would be different in Asian languages).
Assume the following chars: { \uD8xx, \uDCxx, \u002C, \u0031, \u0030,
\u0030, \u0024 }.
If the 2nd char is mangled to \u0039, encoders, taking the
CoderResult.length() for erasure, would produce the following:
- if encode returns CoderResult.malformedForLength(1): ?9,100 $
- if encode returns CoderResult.malformedForLength(2): ?,100 $
I think, the 2nd would be closer to the truth, but it's only an opinion.
... in any case, the calling code could omit interpreting the
CoderResult.length(), and generally skip one char in case of
*malformed*, as there are no other causes for length > 1.
CoderResult.malformedForLength(2) additionally would embody, that for
this result 2 chars have been taken into account. This could be valuable
for the calling code.
As I understand you right, returning length == 2 would not be against
API spec. Is it ?
2.)
What's about \uFFFE and \uFFFF ?
Here they are denoted as "Not a character":
http://www.decodeunicode.org/de/u+fffe
So IMHO they should be valued as *malformed*, but the current JDK's
encoders return *unmappable*.
-Ulf
Am 10.09.2008 21:55, Martin Buchholz schrieb:
> There is another reason, aside from our Beloved Compatibility,
> to prefer returning length == 1. It is likely that the calling code will
> delete the malformed chars and present the rest to a human.
> The second char *might* be valid, so why hide it?
>
> Martin
>
> On Wed, Sep 10, 2008 at 08:22, Ulf Zibis <Ulf.Zibis at gmx.de> wrote:
>
>> Hi Martin,
>>
>> thanks for the quick first answer.
>>
>> You are right, both chars could be corrupt.
>> IMO, if CoderResult.malformedForLength(2) would be returned, this would
>> be more informative, and the SW developer could decide by himself, if he
>> would consider the CoderResult.length().
>> Why having this differentiation by length, if nobody makes use of it?
>> There is no other cause, which would entail a length other than 1 from
>> CharsetEncoder.
>>
>> So do you think, it would be against spec to return a
>> CoderResult.malformedForLength(2) in such cases, even if
>> CoderResult.malformedForLength(1) isn't a bug.
>>
>> BTW:
>> The chance to erroneously receive a high surrogate in range
>> \uD800..\uDBFF is 1.56 %
>> The chance to erroneously receive a char out of range \uDC00..\uDFFF
>> after a correct high surrogate is 99.84 %
>>
>> -Ulf
>>
>>
>> Am 09.09.2008 23:58, Martin Buchholz schrieb:
>>
>>> I think when encountering a single high surrogate,
>>> it is correct to return a length of either 1 or 2.
>>> A thought experiment: a cosmic ray that mangled exactly one char
>>> could have caused this situation if the original sequence was
>>> of length either 1 or 2, depending on which char was mangled.
>>>
>>> Not a Defect.
>>>
>>> Martin
>>>
>>>
>>>
>>
>
>
>
More information about the core-libs-dev
mailing list