RFR: 8248655: Support supplementary characters in String case insensitive operations
naoto.sato at oracle.com
naoto.sato at oracle.com
Fri Jul 17 23:36:00 UTC 2020
Hi,
Based on the suggestions, I modified the fix as follows:
https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
Changes from the initial revision are:
- Shared the implementation between compareToCI() and regionMatchesCI()
- Enabled immediate short cut if two code points match.
- Created a simple JMH benchmark. Here is the scores before and after
the change:
before:
Benchmark Mode Cnt Score Error Units
StringCompareToIgnoreCase.lower avgt 25 53.764 ? 2.811 ns/op
StringCompareToIgnoreCase.supLower avgt 25 24.211 ? 1.135 ns/op
StringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ? 1.344 ns/op
StringCompareToIgnoreCase.upperLower avgt 25 18.859 ? 1.499 ns/op
after:
Benchmark Mode Cnt Score Error Units
StringCompareToIgnoreCase.lower avgt 25 58.354 ? 4.603 ns/op
StringCompareToIgnoreCase.supLower avgt 25 57.975 ? 5.672 ns/op
StringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ? 0.965 ns/op
StringCompareToIgnoreCase.upperLower avgt 25 17.744 ? 0.272 ns/op
Here, "sup" means all supplementary characters, BMP otherwise. "lower"
means each character requires upper->lower casemap. "upperLower" means
all characters are the same, except the last character which requires
casemap.
I think the result is reasonable, considering surrogates check are now
mandatory. I have tried Roger's suggestion to use Arrays.mismatch() but
it did not seem to benefit here. In fact, the performance degraded
partly because I implemented the short cut, and possibly for the
overhead of extra checks.
Naoto
On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote:
> Hello,
>
> Please review the fix to the following issues:
>
> https://bugs.openjdk.java.net/browse/JDK-8248655
> https://bugs.openjdk.java.net/browse/JDK-8248434
>
> The proposed changeset and its CSR are located at:
>
> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
> https://bugs.openjdk.java.net/browse/JDK-8248664
>
> A bug was filed against SimpleDateFormat (8248434) where
> case-insensitive date format/parse failed in some of the new locales in
> JDK15. The root cause was that case-insensitive String.regionMatches()
> method did not work with supplementary characters. The problem is that
> the method's spec does not expect case mappings of supplementary
> characters, possibly because it was overlooked in the first place, JSR
> 204 - "Unicode Supplementary Character support". Similar behavior is
> observed in other two case-insensitive methods, i.e.,
> compareToIgnoreCase() and equalsIgnoreCase().
>
> The fix is straightforward to compare strings by code point basis,
> instead of code unit (16bit "char") basis. Technically this change will
> introduce a backward incompatibility, but I believe it is an
> incompatibility to wrong behavior, not true to the meaning of those
> methods' expectations.
>
> Naoto
More information about the core-libs-dev
mailing list