Empty regexp replaceall and surrogate pairs results in corrupted utf16.
Xueming Shen
xueming.shen at oracle.com
Thu Jun 7 22:46:30 UTC 2012
Personally I don't think it is a bug. A j.l.String represents a sequence
of UTF-16 chars. While
a pair of surrogates represents a supplementary character, a single
surrogate itself is still
a "legal" independent entity inside a String object and length of a
String is still defined as
the total number of char unit and an index value between a high
surrogate and a low
surrogate is still a legal index value that can be used to access the
char at that particular
position. Using an empty String "" as a regex for the replaceAll() takes
the advantage of the
special meaning of "", in which it is interpreted as it can match any
possible zero-width
position of the target String, it does not imply anything regarding
"character" or
"characters" around it, so I would not interpret it as a zero-with
character boundary,
therefor a "position" in between a pair surrogates is still a good
"found" for replacing.
-Sherman
On 6/7/2012 1:07 PM, Dawid Weiss wrote:
> Hi, I'm a committer to the Apache Lucene project. We have randomized
> tests and one seed hit the following (simplified) scenario:
>
> String s1 = "AB\uD840\uDC00C";
> String s2 = s1.replaceAll("", "X");
>
> the input contains an extended unicode character (any surrogate pair
> will do). The pattern is an empty string (in fact, it was randomized
> as "]|" but it's the same problem so I omit the details). The problem
> is that after applying this pattern, replaceAll inserts X in between
> the surrogate pair characters and this results in invalid UTF-16:
>
> ABC
> XAXBX?X?XCX
>
> I believe this is a bug in the regexp implementation (sorry, don't
> have a patch for it) but I'd like to confirm it's not something known.
> Pointers appreciated.
>
> Dawid
More information about the core-libs-dev
mailing list