<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations

Mon Jul 20 22:20:18 UTC 2020

Small correction in the updated part:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/

Naoto

On 7/20/20 2:39 PM, naoto.sato at oracle.com wrote:
> Hi Joe,
> 
> Thank you for your comments.
> 
> On 7/20/20 11:20 AM, Joe Wang wrote:
>> Hi Naoto,
>>
>> StringUTF16: line 384 - 388 seem unnecessary since you'd only get 
>> there if 389:isHighSurrogate is not true. 
> 
> Good point.
> 
> But more importantly, StringUTF16
>> has existing method "codePointAt" you may want to consider instead of 
>> adding a new method.
> 
> If we call codePointAt/Before, it would call an extra getChar(). Since 
> we know one codepoint as an input, I would avoid the extra calls.
> 
>>
>> Comparing to the base benchmark:
>> StringCompareToIgnoreCase.lower          8.5%
>> StringCompareToIgnoreCase.supLower      139%
>> StringCompareToIgnoreCase.supUpperLower  -21.8%
>> StringCompareToIgnoreCase.upperLower     avgt   -5.9%
>>
>>
>> "lower" was 8.5% slower, if such test exists in the specJVM, it would 
>> be considered a regression. I would suggest you run the specJVM. I 
>> agree with you on surrogate check being a requirement, thus supLower 
>> being 139% slower is ok since it won't otherwise be correct anyways.
> 
> Yes, it would be a regression if SPECjvm produces 8+% degradation, but 
> the test suite is aimed at the entire application performance. But for 
> this one, it is a micro benchmark for relatively rarely issued methods 
> (I would think normal cases fall into Latin1 equivalents), I would 
> consider it is acceptable.
> 
>> But after introducing additional operations supUpperLower and 
>> upperLower ran faster? That may indicate irregularity in the tests. 
>> Maybe we should consider running tests with short, long and very long 
>> sample strings to see if we can reduce the noise level and also see 
>> how it fares for a longer string. I assume the machine you're running 
>> the test on was isolated.
> 
> This result pretty much depends on the data it is testing for. As I 
> wrote in the previous email, (sup)UpperLower tests use the data that are 
> almost identical, but one last character is case insensitively equal. So 
> in these cases, the new short cut works really well and not call 
> toLower/UpperCase() at all for most of the characters. Thus the new 
> results are faster. Again the test result is very dependent on the input 
> data, Unless the result showed 100% slower or something (except supLower 
> case), I think it is OK.
> 
> Anyways, here is the updated webrev addressing your first suggestion:
> 
> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/
> 
> Naoto
> 
>>
>> Regards,
>> Joe
>>
>> On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote:
>>> Hi Mark,
>>>
>>> Thank you for your comments.
>>>
>>> On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
>>>> One option is to have a fast path that uses char functions, up to 
>>>> the point where you hit a high surrogate, then drop into the slower 
>>>> codepoint path. That saves time for the high-runner cases.
>>>>
>>>> On the other hand, if the times are good enough, you might not need 
>>>> the complication.
>>>
>>> The implementation is dealing with bare byte arrays. Only methods 
>>> that it uses from Character class are toLowerCase(int) and 
>>> toUpperCase(int) (sans surrogate check, which is needed at each 
>>> iteration anyways), and their "char" equivalents are merely casting 
>>> (char) to the int result. So it might not be so beneficial to 
>>> differentiate char and int paths.
>>>
>>> Having said that, I found that there was an unnecessary surrogate 
>>> check (always checks high *and* low surrogate on each iteration), so 
>>> I revised the fix (added line 380-383 in StringUTF16.java):
>>>
>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/
>>>
>>> Naoto
>>>
>>>>
>>>> Mark
>>>> //////
>>>>
>>>>
>>>> On Fri, Jul 17, 2020 at 4:39 PM <naoto.sato at oracle.com 
>>>> <mailto:naoto.sato at oracle.com>> wrote:
>>>>
>>>>     Hi,
>>>>
>>>>     Based on the suggestions, I modified the fix as follows:
>>>>
>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>>>>
>>>>     Changes from the initial revision are:
>>>>
>>>>     - Shared the implementation between compareToCI() and 
>>>> regionMatchesCI()
>>>>     - Enabled immediate short cut if two code points match.
>>>>     - Created a simple JMH benchmark. Here is the scores before and 
>>>> after
>>>>     the change:
>>>>
>>>>     before:
>>>>     Benchmark                                Mode  Cnt   Score 
>>>>  Error     Units
>>>>     StringCompareToIgnoreCase.lower          avgt   25  53.764 ? 
>>>> 2.811     ns/op
>>>>     StringCompareToIgnoreCase.supLower       avgt   25  24.211 ? 
>>>> 1.135     ns/op
>>>>     StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 
>>>> 1.344     ns/op
>>>>     StringCompareToIgnoreCase.upperLower     avgt   25  18.859 ? 
>>>> 1.499     ns/op
>>>>
>>>>     after:
>>>>     Benchmark                                Mode  Cnt   Score 
>>>>  Error     Units
>>>>     StringCompareToIgnoreCase.lower          avgt   25  58.354 ? 
>>>> 4.603     ns/op
>>>>     StringCompareToIgnoreCase.supLower       avgt   25  57.975 ? 
>>>> 5.672     ns/op
>>>>     StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 
>>>> 0.965     ns/op
>>>>     StringCompareToIgnoreCase.upperLower     avgt   25  17.744 ? 
>>>> 0.272     ns/op
>>>>
>>>>     Here, "sup" means all supplementary characters, BMP otherwise. 
>>>> "lower"
>>>>     means each character requires upper->lower casemap. "upperLower" 
>>>> means
>>>>     all characters are the same, except the last character which 
>>>> requires
>>>>     casemap.
>>>>
>>>>     I think the result is reasonable, considering surrogates check 
>>>> are now
>>>>     mandatory. I have tried Roger's suggestion to use 
>>>> Arrays.mismatch() but
>>>>     it did not seem to benefit here. In fact, the performance degraded
>>>>     partly because I implemented the short cut, and possibly for the
>>>>     overhead of extra checks.
>>>>
>>>>     Naoto
>>>>
>>>>     On 7/15/20 9:00 AM, naoto.sato at oracle.com
>>>>     <mailto:naoto.sato at oracle.com> wrote:
>>>>      > Hello,
>>>>      >
>>>>      > Please review the fix to the following issues:
>>>>      >
>>>>      > https://bugs.openjdk.java.net/browse/JDK-8248655
>>>>      > https://bugs.openjdk.java.net/browse/JDK-8248434
>>>>      >
>>>>      > The proposed changeset and its CSR are located at:
>>>>      >
>>>>      > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>>>      > https://bugs.openjdk.java.net/browse/JDK-8248664
>>>>      >
>>>>      > A bug was filed against SimpleDateFormat (8248434) where
>>>>      > case-insensitive date format/parse failed in some of the new
>>>>     locales in
>>>>      > JDK15. The root cause was that case-insensitive
>>>>     String.regionMatches()
>>>>      > method did not work with supplementary characters. The 
>>>> problem is
>>>>     that
>>>>      > the method's spec does not expect case mappings of supplementary
>>>>      > characters, possibly because it was overlooked in the first
>>>>     place, JSR
>>>>      > 204 - "Unicode Supplementary Character support". Similar 
>>>> behavior is
>>>>      > observed in other two case-insensitive methods, i.e.,
>>>>      > compareToIgnoreCase() and equalsIgnoreCase().
>>>>      >
>>>>      > The fix is straightforward to compare strings by code point 
>>>> basis,
>>>>      > instead of code unit (16bit "char") basis. Technically this
>>>>     change will
>>>>      > introduce a backward incompatibility, but I believe it is an
>>>>      > incompatibility to wrong behavior, not true to the meaning of 
>>>> those
>>>>      > methods' expectations.
>>>>      >
>>>>      > Naoto
>>>>
>>