<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations
naoto.sato at oracle.com
naoto.sato at oracle.com
Tue Jul 21 22:26:43 UTC 2020
Hi Brent,
On 7/21/20 2:56 PM, Brent Christian wrote:
> Hi, Naoto
>
> I have a few comments:
>
> src/java.base/share/classes/java/lang/StringUTF16.java
>
> 379 private static int getSupplementaryCodePoint(byte[] ba, int cp,
> int index, int start, int end)
>
> I think it would be worth a small addition to the comment to reflect
> that non-surrogate chars are returned as-is.
Sure, I will add some more comments to the method.
>
> --
>
> I thought about the scenario of an unpaired low or high surrogate at the
> beginning or end of the string, respectively:
>
> 384 if (Character.isLowSurrogate((char)cp)) {
> 385 if (index > start) {
> ...
> 391 } else if (index + 1 < end) { // cp == high surrogate
> 392 char c = getChar(ba, index + 1);
> ...
> 397 return cp;
>
> It looks like the cp itself would be returned from
> getSupplementaryCodePoint(). And then back in compareToCIImpl(), it's
> converted using Character.to[Upper|Lower]Case(int), which will also
> return the cp itself. I imagine that's the best we could do, so seems
> fine.
Yes, that is exactly what is intended.
>
> Is there a test case for unmatched surrogates at the beginning and end
> of the string ? Should there be ?
Interestingly, there has been a test case for supplementary characters
before this change, where it intentionally begins from a low surrogate,
and ends with a high surrogate, so that it would succeed in the previous
*exact match* logic. Line 82 in RegionMatches.java tests:
"\uD801\uDC28\uD801\uDC29\uFF41a".regionMatches(true, 1, "\uDC28\uD801",
0, 2) == true
And the proposed change is compatible with this test case.
>
> --
>
> I see there are no changes to StringLatin1.regionMatchesCI_UTF16(). I
> presume there are no cases in which toUpperCase(toLowerCase()) of a
> supplementary character could yield a Latin-1 character, yes?
Yes, that is correct.
Naoto
>
> Also, thanks for adding the benchmark!
>
> -Brent
>
> On 7/20/20 3:20 PM, naoto.sato at oracle.com wrote:
>> Small correction in the updated part:
>>
>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/
>>
>> Naoto
>>
>> On 7/20/20 2:39 PM, naoto.sato at oracle.com wrote:
>>> Hi Joe,
>>>
>>> Thank you for your comments.
>>>
>>> On 7/20/20 11:20 AM, Joe Wang wrote:
>>>> Hi Naoto,
>>>>
>>>> StringUTF16: line 384 - 388 seem unnecessary since you'd only get
>>>> there if 389:isHighSurrogate is not true.
>>>
>>> Good point.
>>>
>>> But more importantly, StringUTF16
>>>> has existing method "codePointAt" you may want to consider instead
>>>> of adding a new method.
>>>
>>> If we call codePointAt/Before, it would call an extra getChar().
>>> Since we know one codepoint as an input, I would avoid the extra calls.
>>>
>>>>
>>>> Comparing to the base benchmark:
>>>> StringCompareToIgnoreCase.lower 8.5%
>>>> StringCompareToIgnoreCase.supLower 139%
>>>> StringCompareToIgnoreCase.supUpperLower -21.8%
>>>> StringCompareToIgnoreCase.upperLower avgt -5.9%
>>>>
>>>>
>>>> "lower" was 8.5% slower, if such test exists in the specJVM, it
>>>> would be considered a regression. I would suggest you run the
>>>> specJVM. I agree with you on surrogate check being a requirement,
>>>> thus supLower being 139% slower is ok since it won't otherwise be
>>>> correct anyways.
>>>
>>> Yes, it would be a regression if SPECjvm produces 8+% degradation,
>>> but the test suite is aimed at the entire application performance.
>>> But for this one, it is a micro benchmark for relatively rarely
>>> issued methods (I would think normal cases fall into Latin1
>>> equivalents), I would consider it is acceptable.
>>>
>>>> But after introducing additional operations supUpperLower and
>>>> upperLower ran faster? That may indicate irregularity in the tests.
>>>> Maybe we should consider running tests with short, long and very
>>>> long sample strings to see if we can reduce the noise level and also
>>>> see how it fares for a longer string. I assume the machine you're
>>>> running the test on was isolated.
>>>
>>> This result pretty much depends on the data it is testing for. As I
>>> wrote in the previous email, (sup)UpperLower tests use the data that
>>> are almost identical, but one last character is case insensitively
>>> equal. So in these cases, the new short cut works really well and not
>>> call toLower/UpperCase() at all for most of the characters. Thus the
>>> new results are faster. Again the test result is very dependent on
>>> the input data, Unless the result showed 100% slower or something
>>> (except supLower case), I think it is OK.
>>>
>>> Anyways, here is the updated webrev addressing your first suggestion:
>>>
>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/
>>>
>>> Naoto
>>>
>>>>
>>>> Regards,
>>>> Joe
>>>>
>>>> On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote:
>>>>> Hi Mark,
>>>>>
>>>>> Thank you for your comments.
>>>>>
>>>>> On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
>>>>>> One option is to have a fast path that uses char functions, up to
>>>>>> the point where you hit a high surrogate, then drop into the
>>>>>> slower codepoint path. That saves time for the high-runner cases.
>>>>>>
>>>>>> On the other hand, if the times are good enough, you might not
>>>>>> need the complication.
>>>>>
>>>>> The implementation is dealing with bare byte arrays. Only methods
>>>>> that it uses from Character class are toLowerCase(int) and
>>>>> toUpperCase(int) (sans surrogate check, which is needed at each
>>>>> iteration anyways), and their "char" equivalents are merely casting
>>>>> (char) to the int result. So it might not be so beneficial to
>>>>> differentiate char and int paths.
>>>>>
>>>>> Having said that, I found that there was an unnecessary surrogate
>>>>> check (always checks high *and* low surrogate on each iteration),
>>>>> so I revised the fix (added line 380-383 in StringUTF16.java):
>>>>>
>>>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/
>>>>>
>>>>> Naoto
>>>>>
>>>>>>
>>>>>> Mark
>>>>>> //////
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 17, 2020 at 4:39 PM <naoto.sato at oracle.com
>>>>>> <mailto:naoto.sato at oracle.com>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Based on the suggestions, I modified the fix as follows:
>>>>>>
>>>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>>>>>>
>>>>>> Changes from the initial revision are:
>>>>>>
>>>>>> - Shared the implementation between compareToCI() and
>>>>>> regionMatchesCI()
>>>>>> - Enabled immediate short cut if two code points match.
>>>>>> - Created a simple JMH benchmark. Here is the scores before
>>>>>> and after
>>>>>> the change:
>>>>>>
>>>>>> before:
>>>>>> Benchmark Mode Cnt Score
>>>>>> Error Units
>>>>>> StringCompareToIgnoreCase.lower avgt 25 53.764 ?
>>>>>> 2.811 ns/op
>>>>>> StringCompareToIgnoreCase.supLower avgt 25 24.211 ?
>>>>>> 1.135 ns/op
>>>>>> StringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ?
>>>>>> 1.344 ns/op
>>>>>> StringCompareToIgnoreCase.upperLower avgt 25 18.859 ?
>>>>>> 1.499 ns/op
>>>>>>
>>>>>> after:
>>>>>> Benchmark Mode Cnt Score
>>>>>> Error Units
>>>>>> StringCompareToIgnoreCase.lower avgt 25 58.354 ?
>>>>>> 4.603 ns/op
>>>>>> StringCompareToIgnoreCase.supLower avgt 25 57.975 ?
>>>>>> 5.672 ns/op
>>>>>> StringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ?
>>>>>> 0.965 ns/op
>>>>>> StringCompareToIgnoreCase.upperLower avgt 25 17.744 ?
>>>>>> 0.272 ns/op
>>>>>>
>>>>>> Here, "sup" means all supplementary characters, BMP otherwise.
>>>>>> "lower"
>>>>>> means each character requires upper->lower casemap.
>>>>>> "upperLower" means
>>>>>> all characters are the same, except the last character which
>>>>>> requires
>>>>>> casemap.
>>>>>>
>>>>>> I think the result is reasonable, considering surrogates check
>>>>>> are now
>>>>>> mandatory. I have tried Roger's suggestion to use
>>>>>> Arrays.mismatch() but
>>>>>> it did not seem to benefit here. In fact, the performance
>>>>>> degraded
>>>>>> partly because I implemented the short cut, and possibly for the
>>>>>> overhead of extra checks.
>>>>>>
>>>>>> Naoto
>>>>>>
>>>>>> On 7/15/20 9:00 AM, naoto.sato at oracle.com
>>>>>> <mailto:naoto.sato at oracle.com> wrote:
>>>>>> > Hello,
>>>>>> >
>>>>>> > Please review the fix to the following issues:
>>>>>> >
>>>>>> > https://bugs.openjdk.java.net/browse/JDK-8248655
>>>>>> > https://bugs.openjdk.java.net/browse/JDK-8248434
>>>>>> >
>>>>>> > The proposed changeset and its CSR are located at:
>>>>>> >
>>>>>> > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>>>>> > https://bugs.openjdk.java.net/browse/JDK-8248664
>>>>>> >
>>>>>> > A bug was filed against SimpleDateFormat (8248434) where
>>>>>> > case-insensitive date format/parse failed in some of the new
>>>>>> locales in
>>>>>> > JDK15. The root cause was that case-insensitive
>>>>>> String.regionMatches()
>>>>>> > method did not work with supplementary characters. The
>>>>>> problem is
>>>>>> that
>>>>>> > the method's spec does not expect case mappings of
>>>>>> supplementary
>>>>>> > characters, possibly because it was overlooked in the first
>>>>>> place, JSR
>>>>>> > 204 - "Unicode Supplementary Character support". Similar
>>>>>> behavior is
>>>>>> > observed in other two case-insensitive methods, i.e.,
>>>>>> > compareToIgnoreCase() and equalsIgnoreCase().
>>>>>> >
>>>>>> > The fix is straightforward to compare strings by code point
>>>>>> basis,
>>>>>> > instead of code unit (16bit "char") basis. Technically this
>>>>>> change will
>>>>>> > introduce a backward incompatibility, but I believe it is an
>>>>>> > incompatibility to wrong behavior, not true to the meaning
>>>>>> of those
>>>>>> > methods' expectations.
>>>>>> >
>>>>>> > Naoto
>>>>>>
>>>>
More information about the i18n-dev
mailing list