Empty regexp replaceall and surrogate pairs results in corrupted utf16.

Thu Jun 7 22:46:30 UTC 2012

Personally I don't think it is a bug. A j.l.String represents a sequence 
of UTF-16 chars. While
a pair of surrogates represents a supplementary character, a single 
surrogate itself is still
a "legal" independent entity inside a String object and length of a 
String is still defined as
the total number of char unit and an index value between a high 
surrogate and a low
surrogate is still a legal index value that can be used to access the 
char at that particular
position. Using an empty String "" as a regex for the replaceAll() takes 
the advantage of the
special meaning of "", in which it is interpreted as it can match any 
possible zero-width
position of the target String, it does  not imply anything regarding 
"character"  or
"characters" around it, so I would not interpret it as a zero-with 
character boundary,
therefor a "position" in between a pair surrogates is still a good 
"found" for replacing.

-Sherman

On 6/7/2012 1:07 PM, Dawid Weiss wrote:
> Hi, I'm a committer to the Apache Lucene project. We have randomized
> tests and one seed hit the following (simplified) scenario:
>
>     String s1 = "AB\uD840\uDC00C";
>     String s2 = s1.replaceAll("", "X");
>
> the input contains an extended unicode character (any surrogate pair
> will do). The pattern is an empty string (in fact, it was randomized
> as "]|" but it's the same problem so I omit the details). The problem
> is that after applying this pattern, replaceAll inserts X in between
> the surrogate pair characters and this results in invalid UTF-16:
>
> AB��C
> XAXBX?X?XCX
>
> I believe this is a bug in the regexp implementation (sorry, don't
> have a patch for it) but I'd like to confirm it's not something known.
> Pointers appreciated.
>
> Dawid