<i18n dev> Empty regexp replaceall and surrogate pairs results in corrupted utf16

Dawid Weiss dawid.weiss at gmail.com
Sun May 27 05:28:13 PDT 2012


Hi, I'm a committer to the Apache Lucene project. We have randomized
tests and we hit the following (simplified) scenario:

    String s1 = "AB\uD840\uDC00C";
    String s2 = s1.replaceAll("", "X");

the input contains an extended unicode character (any surrogate pair
will do). The pattern is an empty string (in fact, it was randomized
as "]|" but it's the same problem so I omit the details). The problem
is that after applying this pattern, replaceAll inserts X in between
the surrogate pair characters and this results in invalid UTF-16:

AB��C
XAXBX?X?XCX

Is this a bug (where should I file it) or is this something that is an
inherent feature of the current implementation? Thanks,

Dawid


More information about the i18n-dev mailing list