Which CoderResult for malformed surrogate pairs ?

Martin Buchholz martinrb at google.com
Mon Sep 15 21:36:54 UTC 2008


The fundamental problem is that there are many different ways
for an input sequence to be malformed, especially if you consider
something like transposition to be a single operation.

Because almost all characters in the real world are in the BMP,
the existence of an unpaired surrogate is prima facie evidence
that that particular char is, on its own, malformed,
and has been hit by the proverbial cosmic ray.
Also, returning MALFORMED(1) is the only symmetric solution.
If you return MALFORMED(2) for an unpaired surrogate,
there is no good reason to not include the preceding rather than
the following char, except for implementor convenience.

Martin

On Mon, Sep 15, 2008 at 12:58, Ulf Zibis <Ulf.Zibis at gmx.de> wrote:
> Hi Martin,
>
> thanks for you effort, joining this "academic" discussion. :-)
>
> I just once more try to interpret the javadoc of class CoderResult:
>
> A malformed-input error is reported when a sequence of input units is not
> well-formed. Such errors are described by instances of this class whose
> isMalformed method returns true and whose length method returns the length
> of the malformed sequence. There is one unique instance of this class for
> all malformed-input errors of a given length.
>
> A single HIGH_SURROGATE can't be valued as malformed. It can only be valued
> in sequence with the successive char, and the length of this malformed
> sequence is 2.
>
>
> Am 13.09.2008 21:29, Martin Buchholz schrieb:
>
> ... in any case, the calling code could omit interpreting the
> CoderResult.length(), and generally skip one char in case of *malformed*, as
> there are no other causes for length > 1.
>
>
> Only if the caller is intimately familiar with the implementation.
> If for example the input contains LOW_SURROGATE LOW_SURROGATE,
> then malformed(2) might be returned, and the caller should skip 2.
>
>
> No, in this case malformed(1) should be returned, as a single LOW_SURROGATE
> is always wrong, and the caller should 2 times skip 1.
> Only in case of HIGH_SURROGATE + !LOW_SURROGATE, malformed(2) should be
> returned, and the caller in most cases isn't wrong in skipping the 2 chars,
> as the probability in consequence of a cosmic ray for a HIGH_SURROGATE is
> 1.56 %, but for a !LOW_SURROGATE is 98.44 %.
>
> Any behavior change has to have a very good reason.
>
> Yes, this is a grave argument.
> But how often this behaviour change would brought to bear? In case of
> interpreting the erroneous text by human, the difference IMO shouldn't
> matter.
>
> 2.)
> What's about \uFFFE and \uFFFF ?
> Here they are denoted as "Not a character":
> http://www.decodeunicode.org/de/u+fffe
> So IMHO they should be valued as *malformed*, but the current JDK's encoders
> return *unmappable*.
>
>
> I disagree.  What if the target encoding has a "character"
> with exactly the same "not a character" semantics?
>
>
> Hm, maybe.
>
> E.g. if the target is itself an encoding of  Unicode?
> Then the "character" would be mappable.
>
>
> AFAIK in UTF-16 \uFFFE is only valid as the first char of a sequence. If
> converting to UTF-16BE or UTF-16LE it should be skipped.
> In converting from whatever encoding to UTF-16 the \uFFFE should be prefixed
> automatically.
>
> Anyways, this discussion is mostly academic.
> Convincing me personally might help your cause, ...
>
>
> Yes, that's the reason why I investigated in this discussion.
>
> -Ulf
>
>



More information about the core-libs-dev mailing list