<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations

Sat Jul 18 03:03:01 UTC 2020

One option is to have a fast path that uses char functions, up to the point
where you hit a high surrogate, then drop into the slower codepoint path.
That saves time for the high-runner cases.

On the other hand, if the times are good enough, you might not need the
complication.

Mark

On Fri, Jul 17, 2020 at 4:39 PM <naoto.sato at oracle.com> wrote:

> Hi,
>
> Based on the suggestions, I modified the fix as follows:
>
> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>
> Changes from the initial revision are:
>
> - Shared the implementation between compareToCI() and regionMatchesCI()
> - Enabled immediate short cut if two code points match.
> - Created a simple JMH benchmark. Here is the scores before and after
> the change:
>
> before:
> Benchmark                                Mode  Cnt   Score   Error  Units
> StringCompareToIgnoreCase.lower          avgt   25  53.764 ? 2.811  ns/op
> StringCompareToIgnoreCase.supLower       avgt   25  24.211 ? 1.135  ns/op
> StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 1.344  ns/op
> StringCompareToIgnoreCase.upperLower     avgt   25  18.859 ? 1.499  ns/op
>
> after:
> Benchmark                                Mode  Cnt   Score   Error  Units
> StringCompareToIgnoreCase.lower          avgt   25  58.354 ? 4.603  ns/op
> StringCompareToIgnoreCase.supLower       avgt   25  57.975 ? 5.672  ns/op
> StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 0.965  ns/op
> StringCompareToIgnoreCase.upperLower     avgt   25  17.744 ? 0.272  ns/op
>
> Here, "sup" means all supplementary characters, BMP otherwise. "lower"
> means each character requires upper->lower casemap. "upperLower" means
> all characters are the same, except the last character which requires
> casemap.
>
> I think the result is reasonable, considering surrogates check are now
> mandatory. I have tried Roger's suggestion to use Arrays.mismatch() but
> it did not seem to benefit here. In fact, the performance degraded
> partly because I implemented the short cut, and possibly for the
> overhead of extra checks.
>
> Naoto
>
> On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote:
> > Hello,
> >
> > Please review the fix to the following issues:
> >
> > https://bugs.openjdk.java.net/browse/JDK-8248655
> > https://bugs.openjdk.java.net/browse/JDK-8248434
> >
> > The proposed changeset and its CSR are located at:
> >
> > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
> > https://bugs.openjdk.java.net/browse/JDK-8248664
> >
> > A bug was filed against SimpleDateFormat (8248434) where
> > case-insensitive date format/parse failed in some of the new locales in
> > JDK15. The root cause was that case-insensitive String.regionMatches()
> > method did not work with supplementary characters. The problem is that
> > the method's spec does not expect case mappings of supplementary
> > characters, possibly because it was overlooked in the first place, JSR
> > 204 - "Unicode Supplementary Character support". Similar behavior is
> > observed in other two case-insensitive methods, i.e.,
> > compareToIgnoreCase() and equalsIgnoreCase().
> >
> > The fix is straightforward to compare strings by code point basis,
> > instead of code unit (16bit "char") basis. Technically this change will
> > introduce a backward incompatibility, but I believe it is an
> > incompatibility to wrong behavior, not true to the meaning of those
> > methods' expectations.
> >
> > Naoto
>