<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations

naoto.sato at oracle.com naoto.sato at oracle.com
Tue Jul 21 22:26:43 UTC 2020


Hi Brent,

On 7/21/20 2:56 PM, Brent Christian wrote:
> Hi, Naoto
> 
> I have a few comments:
> 
> src/java.base/share/classes/java/lang/StringUTF16.java
> 
> 379     private static int getSupplementaryCodePoint(byte[] ba, int cp, 
> int index, int start, int end)
> 
> I think it would be worth a small addition to the comment to reflect 
> that non-surrogate chars are returned as-is.

Sure, I will add some more comments to the method.

> 
> -- 
> 
> I thought about the scenario of an unpaired low or high surrogate at the 
> beginning or end of the string, respectively:
> 
> 384         if (Character.isLowSurrogate((char)cp)) {
> 385             if (index > start) {
>                  ...
> 391         } else if (index + 1 < end) { // cp == high surrogate
> 392             char c = getChar(ba, index + 1);
>                  ...
> 397         return cp;
> 
> It looks like the cp itself would be returned from 
> getSupplementaryCodePoint(). And then back in compareToCIImpl(), it's 
> converted using Character.to[Upper|Lower]Case(int), which will also 
> return the cp itself.  I imagine that's the best we could do, so seems 
> fine.

Yes, that is exactly what is intended.

> 
> Is there a test case for unmatched surrogates at the beginning and end 
> of the string ? Should there be ?

Interestingly, there has been a test case for supplementary characters 
before this change, where it intentionally begins from a low surrogate, 
and ends with a high surrogate, so that it would succeed in the previous 
*exact match* logic. Line 82 in RegionMatches.java tests:
"\uD801\uDC28\uD801\uDC29\uFF41a".regionMatches(true, 1, "\uDC28\uD801", 
0, 2) == true
And the proposed change is compatible with this test case.

> 
> -- 
> 
> I see there are no changes to StringLatin1.regionMatchesCI_UTF16().  I 
> presume there are no cases in which toUpperCase(toLowerCase()) of a 
> supplementary character could yield a Latin-1 character, yes?

Yes, that is correct.

Naoto

> 
> Also, thanks for adding the benchmark!
> 
> -Brent
> 
> On 7/20/20 3:20 PM, naoto.sato at oracle.com wrote:
>> Small correction in the updated part:
>>
>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/
>>
>> Naoto
>>
>> On 7/20/20 2:39 PM, naoto.sato at oracle.com wrote:
>>> Hi Joe,
>>>
>>> Thank you for your comments.
>>>
>>> On 7/20/20 11:20 AM, Joe Wang wrote:
>>>> Hi Naoto,
>>>>
>>>> StringUTF16: line 384 - 388 seem unnecessary since you'd only get 
>>>> there if 389:isHighSurrogate is not true. 
>>>
>>> Good point.
>>>
>>> But more importantly, StringUTF16
>>>> has existing method "codePointAt" you may want to consider instead 
>>>> of adding a new method.
>>>
>>> If we call codePointAt/Before, it would call an extra getChar(). 
>>> Since we know one codepoint as an input, I would avoid the extra calls.
>>>
>>>>
>>>> Comparing to the base benchmark:
>>>> StringCompareToIgnoreCase.lower          8.5%
>>>> StringCompareToIgnoreCase.supLower      139%
>>>> StringCompareToIgnoreCase.supUpperLower  -21.8%
>>>> StringCompareToIgnoreCase.upperLower     avgt   -5.9%
>>>>
>>>>
>>>> "lower" was 8.5% slower, if such test exists in the specJVM, it 
>>>> would be considered a regression. I would suggest you run the 
>>>> specJVM. I agree with you on surrogate check being a requirement, 
>>>> thus supLower being 139% slower is ok since it won't otherwise be 
>>>> correct anyways.
>>>
>>> Yes, it would be a regression if SPECjvm produces 8+% degradation, 
>>> but the test suite is aimed at the entire application performance. 
>>> But for this one, it is a micro benchmark for relatively rarely 
>>> issued methods (I would think normal cases fall into Latin1 
>>> equivalents), I would consider it is acceptable.
>>>
>>>> But after introducing additional operations supUpperLower and 
>>>> upperLower ran faster? That may indicate irregularity in the tests. 
>>>> Maybe we should consider running tests with short, long and very 
>>>> long sample strings to see if we can reduce the noise level and also 
>>>> see how it fares for a longer string. I assume the machine you're 
>>>> running the test on was isolated.
>>>
>>> This result pretty much depends on the data it is testing for. As I 
>>> wrote in the previous email, (sup)UpperLower tests use the data that 
>>> are almost identical, but one last character is case insensitively 
>>> equal. So in these cases, the new short cut works really well and not 
>>> call toLower/UpperCase() at all for most of the characters. Thus the 
>>> new results are faster. Again the test result is very dependent on 
>>> the input data, Unless the result showed 100% slower or something 
>>> (except supLower case), I think it is OK.
>>>
>>> Anyways, here is the updated webrev addressing your first suggestion:
>>>
>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/
>>>
>>> Naoto
>>>
>>>>
>>>> Regards,
>>>> Joe
>>>>
>>>> On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote:
>>>>> Hi Mark,
>>>>>
>>>>> Thank you for your comments.
>>>>>
>>>>> On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
>>>>>> One option is to have a fast path that uses char functions, up to 
>>>>>> the point where you hit a high surrogate, then drop into the 
>>>>>> slower codepoint path. That saves time for the high-runner cases.
>>>>>>
>>>>>> On the other hand, if the times are good enough, you might not 
>>>>>> need the complication.
>>>>>
>>>>> The implementation is dealing with bare byte arrays. Only methods 
>>>>> that it uses from Character class are toLowerCase(int) and 
>>>>> toUpperCase(int) (sans surrogate check, which is needed at each 
>>>>> iteration anyways), and their "char" equivalents are merely casting 
>>>>> (char) to the int result. So it might not be so beneficial to 
>>>>> differentiate char and int paths.
>>>>>
>>>>> Having said that, I found that there was an unnecessary surrogate 
>>>>> check (always checks high *and* low surrogate on each iteration), 
>>>>> so I revised the fix (added line 380-383 in StringUTF16.java):
>>>>>
>>>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/
>>>>>
>>>>> Naoto
>>>>>
>>>>>>
>>>>>> Mark
>>>>>> //////
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 17, 2020 at 4:39 PM <naoto.sato at oracle.com 
>>>>>> <mailto:naoto.sato at oracle.com>> wrote:
>>>>>>
>>>>>>     Hi,
>>>>>>
>>>>>>     Based on the suggestions, I modified the fix as follows:
>>>>>>
>>>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>>>>>>
>>>>>>     Changes from the initial revision are:
>>>>>>
>>>>>>     - Shared the implementation between compareToCI() and 
>>>>>> regionMatchesCI()
>>>>>>     - Enabled immediate short cut if two code points match.
>>>>>>     - Created a simple JMH benchmark. Here is the scores before 
>>>>>> and after
>>>>>>     the change:
>>>>>>
>>>>>>     before:
>>>>>>     Benchmark                                Mode  Cnt   Score 
>>>>>>  Error     Units
>>>>>>     StringCompareToIgnoreCase.lower          avgt   25  53.764 ? 
>>>>>> 2.811     ns/op
>>>>>>     StringCompareToIgnoreCase.supLower       avgt   25  24.211 ? 
>>>>>> 1.135     ns/op
>>>>>>     StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 
>>>>>> 1.344     ns/op
>>>>>>     StringCompareToIgnoreCase.upperLower     avgt   25  18.859 ? 
>>>>>> 1.499     ns/op
>>>>>>
>>>>>>     after:
>>>>>>     Benchmark                                Mode  Cnt   Score 
>>>>>>  Error     Units
>>>>>>     StringCompareToIgnoreCase.lower          avgt   25  58.354 ? 
>>>>>> 4.603     ns/op
>>>>>>     StringCompareToIgnoreCase.supLower       avgt   25  57.975 ? 
>>>>>> 5.672     ns/op
>>>>>>     StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 
>>>>>> 0.965     ns/op
>>>>>>     StringCompareToIgnoreCase.upperLower     avgt   25  17.744 ? 
>>>>>> 0.272     ns/op
>>>>>>
>>>>>>     Here, "sup" means all supplementary characters, BMP otherwise. 
>>>>>> "lower"
>>>>>>     means each character requires upper->lower casemap. 
>>>>>> "upperLower" means
>>>>>>     all characters are the same, except the last character which 
>>>>>> requires
>>>>>>     casemap.
>>>>>>
>>>>>>     I think the result is reasonable, considering surrogates check 
>>>>>> are now
>>>>>>     mandatory. I have tried Roger's suggestion to use 
>>>>>> Arrays.mismatch() but
>>>>>>     it did not seem to benefit here. In fact, the performance 
>>>>>> degraded
>>>>>>     partly because I implemented the short cut, and possibly for the
>>>>>>     overhead of extra checks.
>>>>>>
>>>>>>     Naoto
>>>>>>
>>>>>>     On 7/15/20 9:00 AM, naoto.sato at oracle.com
>>>>>>     <mailto:naoto.sato at oracle.com> wrote:
>>>>>>      > Hello,
>>>>>>      >
>>>>>>      > Please review the fix to the following issues:
>>>>>>      >
>>>>>>      > https://bugs.openjdk.java.net/browse/JDK-8248655
>>>>>>      > https://bugs.openjdk.java.net/browse/JDK-8248434
>>>>>>      >
>>>>>>      > The proposed changeset and its CSR are located at:
>>>>>>      >
>>>>>>      > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>>>>>      > https://bugs.openjdk.java.net/browse/JDK-8248664
>>>>>>      >
>>>>>>      > A bug was filed against SimpleDateFormat (8248434) where
>>>>>>      > case-insensitive date format/parse failed in some of the new
>>>>>>     locales in
>>>>>>      > JDK15. The root cause was that case-insensitive
>>>>>>     String.regionMatches()
>>>>>>      > method did not work with supplementary characters. The 
>>>>>> problem is
>>>>>>     that
>>>>>>      > the method's spec does not expect case mappings of 
>>>>>> supplementary
>>>>>>      > characters, possibly because it was overlooked in the first
>>>>>>     place, JSR
>>>>>>      > 204 - "Unicode Supplementary Character support". Similar 
>>>>>> behavior is
>>>>>>      > observed in other two case-insensitive methods, i.e.,
>>>>>>      > compareToIgnoreCase() and equalsIgnoreCase().
>>>>>>      >
>>>>>>      > The fix is straightforward to compare strings by code point 
>>>>>> basis,
>>>>>>      > instead of code unit (16bit "char") basis. Technically this
>>>>>>     change will
>>>>>>      > introduce a backward incompatibility, but I believe it is an
>>>>>>      > incompatibility to wrong behavior, not true to the meaning 
>>>>>> of those
>>>>>>      > methods' expectations.
>>>>>>      >
>>>>>>      > Naoto
>>>>>>
>>>>


More information about the core-libs-dev mailing list