<i18n dev> RFR: 8365675: Add String Unicode Case-Folding Support [v6]
Roger Riggs
rriggs at openjdk.org
Wed Oct 29 20:44:01 UTC 2025
On Wed, 29 Oct 2025 02:00:16 GMT, Xueming Shen <sherman at openjdk.org> wrote:
>> Experimenting with Arrays.mismatch at the beginning of the array iteration as
>>
>>
>> int k = ArraysSupport.mismatch(value, other, lim);
>> if (k < 0)
>> return len - olen;
>> for (; k < lim; k++) {
>> ....
>> }
>>
>>
>> The benchmark results suggest that it does help 'dramatically' when the compared strings share with the same prefix. For example those "UpperLower" test cases (which shares the same upper cases text prefix. However it is also relatively expensive, with a 20%-ish overhead when the strings do not share the same string text but are case-insensitively equals. I would suggest let's leave it out for now?
>>
>> ### With Arrays.mismatch
>>
>>
>> Benchmark Mode Cnt Score Error Units
>> StringCompareToFoldCase.asciiLower avgt 15 15.044 ± 0.751 ns/op
>> StringCompareToFoldCase.asciiLowerEQ avgt 15 10.033 ± 0.366 ns/op
>> StringCompareToFoldCase.asciiLowerEQFC avgt 15 12.094 ± 0.288 ns/op
>> StringCompareToFoldCase.asciiLowerFC avgt 15 12.513 ± 0.290 ns/op
>> StringCompareToFoldCase.asciiUpperLower avgt 15 11.716 ± 0.471 ns/op
>> StringCompareToFoldCase.asciiUpperLowerEQ avgt 15 11.120 ± 0.458 ns/op
>> StringCompareToFoldCase.asciiUpperLowerEQFC avgt 15 7.544 ± 0.103 ns/op
>> StringCompareToFoldCase.asciiUpperLowerFC avgt 15 7.384 ± 0.167 ns/op
>> StringCompareToFoldCase.asciiWithDFFC avgt 15 54.949 ± 1.260 ns/op
>> StringCompareToFoldCase.greekLower avgt 15 39.492 ± 0.124 ns/op
>> StringCompareToFoldCase.greekLowerEQ avgt 15 39.266 ± 0.071 ns/op
>> StringCompareToFoldCase.greekLowerEQFC avgt 15 28.049 ± 0.292 ns/op
>> StringCompareToFoldCase.greekLowerFC avgt 15 28.272 ± 0.115 ns/op
>> StringCompareToFoldCase.greekUpperLower avgt 15 7.103 ± 0.052 ns/op
>> StringCompareToFoldCase.greekUpperLowerEQ avgt 15 7.439 ± 0.079 ns/op
>> StringCompareToFoldCase.greekUpperLowerEQFC avgt 15 2.716 ± 0.138 ns/op
>> StringCompareToFoldCase.greekUpperLowerFC avgt 15 2.628 ± 0.051 ns/op
>> StringCompareToFoldCase.latin1UTF16 avgt 15 23.147 ± 0.094 ns/op
>> StringCompareToFoldCase.latin1UTF16EQ avgt 15 22.626 ± 0.081 ns/op
>> StringCompareToFoldCase.latin1UTF16EQFC avgt 15 38.453 ± 0.697 ns/op
>> StringCompareToFoldCase.latin1UTF16FC avgt 15 38.464 ± 0.441 ns/op
>> StringCompareToFoldCase....
>
> ### Long: packing 1:M-count + 1-3 folding codepoints
>
> https://cr.openjdk.org/~sherman/casefolding_long/
>
> The performance is slightly better, but not as good as I would have expected. The access to codepoint from the long looks a little clumsy, but the logic looks smooth. need more work. opinion?
>
>
> Benchmark Mode Cnt Score Error Units
> StringCompareToFoldCase.asciiLower avgt 15 15.487 ± 0.298 ns/op
> StringCompareToFoldCase.asciiLowerEQ avgt 15 10.005 ± 0.368 ns/op
> StringCompareToFoldCase.asciiLowerEQFC avgt 15 10.755 ± 0.160 ns/op
> StringCompareToFoldCase.asciiLowerFC avgt 15 10.349 ± 0.155 ns/op
> StringCompareToFoldCase.asciiUpperLower avgt 15 12.188 ± 0.278 ns/op
> StringCompareToFoldCase.asciiUpperLowerEQ avgt 15 10.901 ± 0.551 ns/op
> StringCompareToFoldCase.asciiUpperLowerEQFC avgt 15 9.218 ± 0.165 ns/op
> StringCompareToFoldCase.asciiUpperLowerFC avgt 15 9.335 ± 0.404 ns/op
> StringCompareToFoldCase.asciiWithDFFC avgt 15 37.010 ± 0.518 ns/op
> StringCompareToFoldCase.greekLower avgt 15 39.572 ± 0.098 ns/op
> StringCompareToFoldCase.greekLowerEQ avgt 15 39.317 ± 0.104 ns/op
> StringCompareToFoldCase.greekLowerEQFC avgt 15 20.428 ± 0.243 ns/op
> StringCompareToFoldCase.greekLowerFC avgt 15 19.623 ± 0.141 ns/op
> StringCompareToFoldCase.greekUpperLower avgt 15 7.105 ± 0.048 ns/op
> StringCompareToFoldCase.greekUpperLowerEQ avgt 15 7.462 ± 0.092 ns/op
> StringCompareToFoldCase.greekUpperLowerEQFC avgt 15 6.518 ± 0.128 ns/op
> StringCompareToFoldCase.greekUpperLowerFC avgt 15 6.593 ± 0.240 ns/op
> StringCompareToFoldCase.latin1UTF16 avgt 15 23.130 ± 0.152 ns/op
> StringCompareToFoldCase.latin1UTF16EQ avgt 15 22.606 ± 0.089 ns/op
> StringCompareToFoldCase.latin1UTF16EQFC avgt 15 29.574 ± 0.348 ns/op
> StringCompareToFoldCase.latin1UTF16FC avgt 15 29.691 ± 0.445 ns/op
> StringCompareToFoldCase.supLower avgt 15 55.027 ± 0.676 ns/op
> StringCompareToFoldCase.supLowerEQ avgt 15 55.784 ± 0.368 ns/op
> StringCompareToFoldCase.supLowerEQFC avgt 15 24.984 ± 0.157 ns/op
> StringCompareToFoldCase.supLowerFC avgt 15 24.865 ± 0.139 ns/op
> StringCompareToFoldCase.supUpperLower avgt 15 14.538 ± 0.144 ns/op
> StringCompareToFoldCase.supUpperLowerEQ avgt 15 14.728 ± 0.206 ns/op
> StringCompareT...
> Experimenting with Arrays.mismatch at the beginning of the array iteration as
> ...
> The benchmark results suggest that it does help 'dramatically' when the compared strings share with the same prefix. For example those "UpperLower" test cases (which shares the same upper cases text prefix. However it is also relatively expensive, with a 20%-ish overhead when the strings do not share the same string text but are case-insensitively equals. I would suggest let's leave it out for now?
> ```
Ok to leave it out for now. In similar contexts where System.arraycopy or Arrays.mismatch has some overhead I've suggested doing a simple check (like `size < 8`) to avoid the overhead when the strings/byte arrays are short.
Thanks for checking.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2475377160
More information about the i18n-dev
mailing list