<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations

Tue Jul 21 21:56:01 UTC 2020

Hi, Naoto

I have a few comments:

src/java.base/share/classes/java/lang/StringUTF16.java

379     private static int getSupplementaryCodePoint(byte[] ba, int cp, 
int index, int start, int end)

I think it would be worth a small addition to the comment to reflect 
that non-surrogate chars are returned as-is.

--

I thought about the scenario of an unpaired low or high surrogate at the 
beginning or end of the string, respectively:

384         if (Character.isLowSurrogate((char)cp)) {
385             if (index > start) {
                 ...
391         } else if (index + 1 < end) { // cp == high surrogate
392             char c = getChar(ba, index + 1);
                 ...
397         return cp;

It looks like the cp itself would be returned from 
getSupplementaryCodePoint(). And then back in compareToCIImpl(), it's 
converted using Character.to[Upper|Lower]Case(int), which will also 
return the cp itself.  I imagine that's the best we could do, so seems fine.

Is there a test case for unmatched surrogates at the beginning and end 
of the string ? Should there be ?

--

I see there are no changes to StringLatin1.regionMatchesCI_UTF16().  I 
presume there are no cases in which toUpperCase(toLowerCase()) of a 
supplementary character could yield a Latin-1 character, yes?

Also, thanks for adding the benchmark!

-Brent

On 7/20/20 3:20 PM, naoto.sato at oracle.com wrote:
> Small correction in the updated part:
> 
> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/
> 
> Naoto
> 
> On 7/20/20 2:39 PM, naoto.sato at oracle.com wrote:
>> Hi Joe,
>>
>> Thank you for your comments.
>>
>> On 7/20/20 11:20 AM, Joe Wang wrote:
>>> Hi Naoto,
>>>
>>> StringUTF16: line 384 - 388 seem unnecessary since you'd only get 
>>> there if 389:isHighSurrogate is not true. 
>>
>> Good point.
>>
>> But more importantly, StringUTF16
>>> has existing method "codePointAt" you may want to consider instead of 
>>> adding a new method.
>>
>> If we call codePointAt/Before, it would call an extra getChar(). Since 
>> we know one codepoint as an input, I would avoid the extra calls.
>>
>>>
>>> Comparing to the base benchmark:
>>> StringCompareToIgnoreCase.lower          8.5%
>>> StringCompareToIgnoreCase.supLower      139%
>>> StringCompareToIgnoreCase.supUpperLower  -21.8%
>>> StringCompareToIgnoreCase.upperLower     avgt   -5.9%
>>>
>>>
>>> "lower" was 8.5% slower, if such test exists in the specJVM, it would 
>>> be considered a regression. I would suggest you run the specJVM. I 
>>> agree with you on surrogate check being a requirement, thus supLower 
>>> being 139% slower is ok since it won't otherwise be correct anyways.
>>
>> Yes, it would be a regression if SPECjvm produces 8+% degradation, but 
>> the test suite is aimed at the entire application performance. But for 
>> this one, it is a micro benchmark for relatively rarely issued methods 
>> (I would think normal cases fall into Latin1 equivalents), I would 
>> consider it is acceptable.
>>
>>> But after introducing additional operations supUpperLower and 
>>> upperLower ran faster? That may indicate irregularity in the tests. 
>>> Maybe we should consider running tests with short, long and very long 
>>> sample strings to see if we can reduce the noise level and also see 
>>> how it fares for a longer string. I assume the machine you're running 
>>> the test on was isolated.
>>
>> This result pretty much depends on the data it is testing for. As I 
>> wrote in the previous email, (sup)UpperLower tests use the data that 
>> are almost identical, but one last character is case insensitively 
>> equal. So in these cases, the new short cut works really well and not 
>> call toLower/UpperCase() at all for most of the characters. Thus the 
>> new results are faster. Again the test result is very dependent on the 
>> input data, Unless the result showed 100% slower or something (except 
>> supLower case), I think it is OK.
>>
>> Anyways, here is the updated webrev addressing your first suggestion:
>>
>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/
>>
>> Naoto
>>
>>>
>>> Regards,
>>> Joe
>>>
>>> On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote:
>>>> Hi Mark,
>>>>
>>>> Thank you for your comments.
>>>>
>>>> On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
>>>>> One option is to have a fast path that uses char functions, up to 
>>>>> the point where you hit a high surrogate, then drop into the slower 
>>>>> codepoint path. That saves time for the high-runner cases.
>>>>>
>>>>> On the other hand, if the times are good enough, you might not need 
>>>>> the complication.
>>>>
>>>> The implementation is dealing with bare byte arrays. Only methods 
>>>> that it uses from Character class are toLowerCase(int) and 
>>>> toUpperCase(int) (sans surrogate check, which is needed at each 
>>>> iteration anyways), and their "char" equivalents are merely casting 
>>>> (char) to the int result. So it might not be so beneficial to 
>>>> differentiate char and int paths.
>>>>
>>>> Having said that, I found that there was an unnecessary surrogate 
>>>> check (always checks high *and* low surrogate on each iteration), so 
>>>> I revised the fix (added line 380-383 in StringUTF16.java):
>>>>
>>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/
>>>>
>>>> Naoto
>>>>
>>>>>
>>>>> Mark
>>>>> //////
>>>>>
>>>>>
>>>>> On Fri, Jul 17, 2020 at 4:39 PM <naoto.sato at oracle.com 
>>>>> <mailto:naoto.sato at oracle.com>> wrote:
>>>>>
>>>>>     Hi,
>>>>>
>>>>>     Based on the suggestions, I modified the fix as follows:
>>>>>
>>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>>>>>
>>>>>     Changes from the initial revision are:
>>>>>
>>>>>     - Shared the implementation between compareToCI() and 
>>>>> regionMatchesCI()
>>>>>     - Enabled immediate short cut if two code points match.
>>>>>     - Created a simple JMH benchmark. Here is the scores before and 
>>>>> after
>>>>>     the change:
>>>>>
>>>>>     before:
>>>>>     Benchmark                                Mode  Cnt   Score 
>>>>>  Error     Units
>>>>>     StringCompareToIgnoreCase.lower          avgt   25  53.764 ? 
>>>>> 2.811     ns/op
>>>>>     StringCompareToIgnoreCase.supLower       avgt   25  24.211 ? 
>>>>> 1.135     ns/op
>>>>>     StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 
>>>>> 1.344     ns/op
>>>>>     StringCompareToIgnoreCase.upperLower     avgt   25  18.859 ? 
>>>>> 1.499     ns/op
>>>>>
>>>>>     after:
>>>>>     Benchmark                                Mode  Cnt   Score 
>>>>>  Error     Units
>>>>>     StringCompareToIgnoreCase.lower          avgt   25  58.354 ? 
>>>>> 4.603     ns/op
>>>>>     StringCompareToIgnoreCase.supLower       avgt   25  57.975 ? 
>>>>> 5.672     ns/op
>>>>>     StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 
>>>>> 0.965     ns/op
>>>>>     StringCompareToIgnoreCase.upperLower     avgt   25  17.744 ? 
>>>>> 0.272     ns/op
>>>>>
>>>>>     Here, "sup" means all supplementary characters, BMP otherwise. 
>>>>> "lower"
>>>>>     means each character requires upper->lower casemap. 
>>>>> "upperLower" means
>>>>>     all characters are the same, except the last character which 
>>>>> requires
>>>>>     casemap.
>>>>>
>>>>>     I think the result is reasonable, considering surrogates check 
>>>>> are now
>>>>>     mandatory. I have tried Roger's suggestion to use 
>>>>> Arrays.mismatch() but
>>>>>     it did not seem to benefit here. In fact, the performance degraded
>>>>>     partly because I implemented the short cut, and possibly for the
>>>>>     overhead of extra checks.
>>>>>
>>>>>     Naoto
>>>>>
>>>>>     On 7/15/20 9:00 AM, naoto.sato at oracle.com
>>>>>     <mailto:naoto.sato at oracle.com> wrote:
>>>>>      > Hello,
>>>>>      >
>>>>>      > Please review the fix to the following issues:
>>>>>      >
>>>>>      > https://bugs.openjdk.java.net/browse/JDK-8248655
>>>>>      > https://bugs.openjdk.java.net/browse/JDK-8248434
>>>>>      >
>>>>>      > The proposed changeset and its CSR are located at:
>>>>>      >
>>>>>      > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>>>>      > https://bugs.openjdk.java.net/browse/JDK-8248664
>>>>>      >
>>>>>      > A bug was filed against SimpleDateFormat (8248434) where
>>>>>      > case-insensitive date format/parse failed in some of the new
>>>>>     locales in
>>>>>      > JDK15. The root cause was that case-insensitive
>>>>>     String.regionMatches()
>>>>>      > method did not work with supplementary characters. The 
>>>>> problem is
>>>>>     that
>>>>>      > the method's spec does not expect case mappings of 
>>>>> supplementary
>>>>>      > characters, possibly because it was overlooked in the first
>>>>>     place, JSR
>>>>>      > 204 - "Unicode Supplementary Character support". Similar 
>>>>> behavior is
>>>>>      > observed in other two case-insensitive methods, i.e.,
>>>>>      > compareToIgnoreCase() and equalsIgnoreCase().
>>>>>      >
>>>>>      > The fix is straightforward to compare strings by code point 
>>>>> basis,
>>>>>      > instead of code unit (16bit "char") basis. Technically this
>>>>>     change will
>>>>>      > introduce a backward incompatibility, but I believe it is an
>>>>>      > incompatibility to wrong behavior, not true to the meaning 
>>>>> of those
>>>>>      > methods' expectations.
>>>>>      >
>>>>>      > Naoto
>>>>>
>>>