<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations
Joe Wang
huizhe.wang at oracle.com
Mon Jul 20 18:20:36 UTC 2020
Hi Naoto,
StringUTF16: line 384 - 388 seem unnecessary since you'd only get there
if 389:isHighSurrogate is not true. But more importantly, StringUTF16
has existing method "codePointAt" you may want to consider instead of
adding a new method.
Comparing to the base benchmark:
StringCompareToIgnoreCase.lower 8.5%
StringCompareToIgnoreCase.supLower 139%
StringCompareToIgnoreCase.supUpperLower -21.8%
StringCompareToIgnoreCase.upperLower avgt -5.9%
"lower" was 8.5% slower, if such test exists in the specJVM, it would be
considered a regression. I would suggest you run the specJVM. I agree
with you on surrogate check being a requirement, thus supLower being
139% slower is ok since it won't otherwise be correct anyways. But after
introducing additional operations supUpperLower and upperLower ran
faster? That may indicate irregularity in the tests. Maybe we should
consider running tests with short, long and very long sample strings to
see if we can reduce the noise level and also see how it fares for a
longer string. I assume the machine you're running the test on was isolated.
Regards,
Joe
On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote:
> Hi Mark,
>
> Thank you for your comments.
>
> On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
>> One option is to have a fast path that uses char functions, up to the
>> point where you hit a high surrogate, then drop into the slower
>> codepoint path. That saves time for the high-runner cases.
>>
>> On the other hand, if the times are good enough, you might not need
>> the complication.
>
> The implementation is dealing with bare byte arrays. Only methods that
> it uses from Character class are toLowerCase(int) and toUpperCase(int)
> (sans surrogate check, which is needed at each iteration anyways), and
> their "char" equivalents are merely casting (char) to the int result.
> So it might not be so beneficial to differentiate char and int paths.
>
> Having said that, I found that there was an unnecessary surrogate
> check (always checks high *and* low surrogate on each iteration), so I
> revised the fix (added line 380-383 in StringUTF16.java):
>
> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/
>
> Naoto
>
>>
>> Mark
>> //////
>>
>>
>> On Fri, Jul 17, 2020 at 4:39 PM <naoto.sato at oracle.com
>> <mailto:naoto.sato at oracle.com>> wrote:
>>
>> Hi,
>>
>> Based on the suggestions, I modified the fix as follows:
>>
>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>>
>> Changes from the initial revision are:
>>
>> - Shared the implementation between compareToCI() and
>> regionMatchesCI()
>> - Enabled immediate short cut if two code points match.
>> - Created a simple JMH benchmark. Here is the scores before and
>> after
>> the change:
>>
>> before:
>> Benchmark Mode Cnt Score Error
>> Units
>> StringCompareToIgnoreCase.lower avgt 25 53.764 ?
>> 2.811 ns/op
>> StringCompareToIgnoreCase.supLower avgt 25 24.211 ?
>> 1.135 ns/op
>> StringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ?
>> 1.344 ns/op
>> StringCompareToIgnoreCase.upperLower avgt 25 18.859 ?
>> 1.499 ns/op
>>
>> after:
>> Benchmark Mode Cnt Score Error
>> Units
>> StringCompareToIgnoreCase.lower avgt 25 58.354 ?
>> 4.603 ns/op
>> StringCompareToIgnoreCase.supLower avgt 25 57.975 ?
>> 5.672 ns/op
>> StringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ?
>> 0.965 ns/op
>> StringCompareToIgnoreCase.upperLower avgt 25 17.744 ?
>> 0.272 ns/op
>>
>> Here, "sup" means all supplementary characters, BMP otherwise.
>> "lower"
>> means each character requires upper->lower casemap. "upperLower"
>> means
>> all characters are the same, except the last character which
>> requires
>> casemap.
>>
>> I think the result is reasonable, considering surrogates check
>> are now
>> mandatory. I have tried Roger's suggestion to use
>> Arrays.mismatch() but
>> it did not seem to benefit here. In fact, the performance degraded
>> partly because I implemented the short cut, and possibly for the
>> overhead of extra checks.
>>
>> Naoto
>>
>> On 7/15/20 9:00 AM, naoto.sato at oracle.com
>> <mailto:naoto.sato at oracle.com> wrote:
>> > Hello,
>> >
>> > Please review the fix to the following issues:
>> >
>> > https://bugs.openjdk.java.net/browse/JDK-8248655
>> > https://bugs.openjdk.java.net/browse/JDK-8248434
>> >
>> > The proposed changeset and its CSR are located at:
>> >
>> > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>> > https://bugs.openjdk.java.net/browse/JDK-8248664
>> >
>> > A bug was filed against SimpleDateFormat (8248434) where
>> > case-insensitive date format/parse failed in some of the new
>> locales in
>> > JDK15. The root cause was that case-insensitive
>> String.regionMatches()
>> > method did not work with supplementary characters. The problem is
>> that
>> > the method's spec does not expect case mappings of supplementary
>> > characters, possibly because it was overlooked in the first
>> place, JSR
>> > 204 - "Unicode Supplementary Character support". Similar
>> behavior is
>> > observed in other two case-insensitive methods, i.e.,
>> > compareToIgnoreCase() and equalsIgnoreCase().
>> >
>> > The fix is straightforward to compare strings by code point
>> basis,
>> > instead of code unit (16bit "char") basis. Technically this
>> change will
>> > introduce a backward incompatibility, but I believe it is an
>> > incompatibility to wrong behavior, not true to the meaning of
>> those
>> > methods' expectations.
>> >
>> > Naoto
>>
More information about the core-libs-dev
mailing list