<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations

Tue Jul 21 14:50:38 UTC 2020

Hi Naoto,

StringUTF16.java:

343, 348: The masking and comparisons seem awkward.
I'd suggest just negating the value and testing for < 0 to flag the less 
usual case.
If you continue with the flag bit, define your own static final constant 
for that bit;
It looks odd to be using Integer.MIN_VALUE with bit operators.

Performance wise, I'm a bit cautious about all the if's and branches;

And only calling getSupplementaryCodePoint if cp was a suggorate might 
have some
benefit, but its hard to tell what will be inlined and when.

Thanks, Roger

On 7/20/20 6:20 PM, naoto.sato at oracle.com wrote:
> Small correction in the updated part:
>
> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/
>
> Naoto
>
> On 7/20/20 2:39 PM, naoto.sato at oracle.com wrote:
>> Hi Joe,
>>
>> Thank you for your comments.
>>
>> On 7/20/20 11:20 AM, Joe Wang wrote:
>>> Hi Naoto,
>>>
>>> StringUTF16: line 384 - 388 seem unnecessary since you'd only get 
>>> there if 389:isHighSurrogate is not true. 
>>
>> Good point.
>>
>> But more importantly, StringUTF16
>>> has existing method "codePointAt" you may want to consider instead 
>>> of adding a new method.
>>
>> If we call codePointAt/Before, it would call an extra getChar(). 
>> Since we know one codepoint as an input, I would avoid the extra calls.
>>
>>>
>>> Comparing to the base benchmark:
>>> StringCompareToIgnoreCase.lower          8.5%
>>> StringCompareToIgnoreCase.supLower      139%
>>> StringCompareToIgnoreCase.supUpperLower  -21.8%
>>> StringCompareToIgnoreCase.upperLower     avgt   -5.9%
>>>
>>>
>>> "lower" was 8.5% slower, if such test exists in the specJVM, it 
>>> would be considered a regression. I would suggest you run the 
>>> specJVM. I agree with you on surrogate check being a requirement, 
>>> thus supLower being 139% slower is ok since it won't otherwise be 
>>> correct anyways.
>>
>> Yes, it would be a regression if SPECjvm produces 8+% degradation, 
>> but the test suite is aimed at the entire application performance. 
>> But for this one, it is a micro benchmark for relatively rarely 
>> issued methods (I would think normal cases fall into Latin1 
>> equivalents), I would consider it is acceptable.
>>
>>> But after introducing additional operations supUpperLower and 
>>> upperLower ran faster? That may indicate irregularity in the tests. 
>>> Maybe we should consider running tests with short, long and very 
>>> long sample strings to see if we can reduce the noise level and also 
>>> see how it fares for a longer string. I assume the machine you're 
>>> running the test on was isolated.
>>
>> This result pretty much depends on the data it is testing for. As I 
>> wrote in the previous email, (sup)UpperLower tests use the data that 
>> are almost identical, but one last character is case insensitively 
>> equal. So in these cases, the new short cut works really well and not 
>> call toLower/UpperCase() at all for most of the characters. Thus the 
>> new results are faster. Again the test result is very dependent on 
>> the input data, Unless the result showed 100% slower or something 
>> (except supLower case), I think it is OK.
>>
>> Anyways, here is the updated webrev addressing your first suggestion:
>>
>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/
>>
>> Naoto
>>
>>>
>>> Regards,
>>> Joe
>>>
>>> On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote:
>>>> Hi Mark,
>>>>
>>>> Thank you for your comments.
>>>>
>>>> On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
>>>>> One option is to have a fast path that uses char functions, up to 
>>>>> the point where you hit a high surrogate, then drop into the 
>>>>> slower codepoint path. That saves time for the high-runner cases.
>>>>>
>>>>> On the other hand, if the times are good enough, you might not 
>>>>> need the complication.
>>>>
>>>> The implementation is dealing with bare byte arrays. Only methods 
>>>> that it uses from Character class are toLowerCase(int) and 
>>>> toUpperCase(int) (sans surrogate check, which is needed at each 
>>>> iteration anyways), and their "char" equivalents are merely casting 
>>>> (char) to the int result. So it might not be so beneficial to 
>>>> differentiate char and int paths.
>>>>
>>>> Having said that, I found that there was an unnecessary surrogate 
>>>> check (always checks high *and* low surrogate on each iteration), 
>>>> so I revised the fix (added line 380-383 in StringUTF16.java):
>>>>
>>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/
>>>>
>>>> Naoto
>>>>
>>>>>
>>>>> Mark
>>>>> //////
>>>>>
>>>>>
>>>>> On Fri, Jul 17, 2020 at 4:39 PM <naoto.sato at oracle.com 
>>>>> <mailto:naoto.sato at oracle.com>> wrote:
>>>>>
>>>>>     Hi,
>>>>>
>>>>>     Based on the suggestions, I modified the fix as follows:
>>>>>
>>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>>>>>
>>>>>     Changes from the initial revision are:
>>>>>
>>>>>     - Shared the implementation between compareToCI() and 
>>>>> regionMatchesCI()
>>>>>     - Enabled immediate short cut if two code points match.
>>>>>     - Created a simple JMH benchmark. Here is the scores before 
>>>>> and after
>>>>>     the change:
>>>>>
>>>>>     before:
>>>>>     Benchmark                                Mode  Cnt  Score 
>>>>>  Error     Units
>>>>>     StringCompareToIgnoreCase.lower          avgt   25 53.764 ? 
>>>>> 2.811     ns/op
>>>>>     StringCompareToIgnoreCase.supLower       avgt   25 24.211 ? 
>>>>> 1.135     ns/op
>>>>>     StringCompareToIgnoreCase.supUpperLower  avgt   25 30.595 ? 
>>>>> 1.344     ns/op
>>>>>     StringCompareToIgnoreCase.upperLower     avgt   25 18.859 ? 
>>>>> 1.499     ns/op
>>>>>
>>>>>     after:
>>>>>     Benchmark                                Mode  Cnt  Score 
>>>>>  Error     Units
>>>>>     StringCompareToIgnoreCase.lower          avgt   25 58.354 ? 
>>>>> 4.603     ns/op
>>>>>     StringCompareToIgnoreCase.supLower       avgt   25 57.975 ? 
>>>>> 5.672     ns/op
>>>>>     StringCompareToIgnoreCase.supUpperLower  avgt   25 23.912 ? 
>>>>> 0.965     ns/op
>>>>>     StringCompareToIgnoreCase.upperLower     avgt   25 17.744 ? 
>>>>> 0.272     ns/op
>>>>>
>>>>>     Here, "sup" means all supplementary characters, BMP otherwise. 
>>>>> "lower"
>>>>>     means each character requires upper->lower casemap. 
>>>>> "upperLower" means
>>>>>     all characters are the same, except the last character which 
>>>>> requires
>>>>>     casemap.
>>>>>
>>>>>     I think the result is reasonable, considering surrogates check 
>>>>> are now
>>>>>     mandatory. I have tried Roger's suggestion to use 
>>>>> Arrays.mismatch() but
>>>>>     it did not seem to benefit here. In fact, the performance 
>>>>> degraded
>>>>>     partly because I implemented the short cut, and possibly for the
>>>>>     overhead of extra checks.
>>>>>
>>>>>     Naoto
>>>>>
>>>>>     On 7/15/20 9:00 AM, naoto.sato at oracle.com
>>>>>     <mailto:naoto.sato at oracle.com> wrote:
>>>>>      > Hello,
>>>>>      >
>>>>>      > Please review the fix to the following issues:
>>>>>      >
>>>>>      > https://bugs.openjdk.java.net/browse/JDK-8248655
>>>>>      > https://bugs.openjdk.java.net/browse/JDK-8248434
>>>>>      >
>>>>>      > The proposed changeset and its CSR are located at:
>>>>>      >
>>>>>      > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>>>>      > https://bugs.openjdk.java.net/browse/JDK-8248664
>>>>>      >
>>>>>      > A bug was filed against SimpleDateFormat (8248434) where
>>>>>      > case-insensitive date format/parse failed in some of the new
>>>>>     locales in
>>>>>      > JDK15. The root cause was that case-insensitive
>>>>>     String.regionMatches()
>>>>>      > method did not work with supplementary characters. The 
>>>>> problem is
>>>>>     that
>>>>>      > the method's spec does not expect case mappings of 
>>>>> supplementary
>>>>>      > characters, possibly because it was overlooked in the first
>>>>>     place, JSR
>>>>>      > 204 - "Unicode Supplementary Character support". Similar 
>>>>> behavior is
>>>>>      > observed in other two case-insensitive methods, i.e.,
>>>>>      > compareToIgnoreCase() and equalsIgnoreCase().
>>>>>      >
>>>>>      > The fix is straightforward to compare strings by code point 
>>>>> basis,
>>>>>      > instead of code unit (16bit "char") basis. Technically this
>>>>>     change will
>>>>>      > introduce a backward incompatibility, but I believe it is an
>>>>>      > incompatibility to wrong behavior, not true to the meaning 
>>>>> of those
>>>>>      > methods' expectations.
>>>>>      >
>>>>>      > Naoto
>>>>>
>>>