<i18n dev> RFR: 8365675: Add String Unicode Case-Folding Support [v6]
Roger Riggs
rriggs at openjdk.org
Wed Oct 29 21:09:33 UTC 2025
On Wed, 29 Oct 2025 20:41:06 GMT, Roger Riggs <rriggs at openjdk.org> wrote:
>> ### Long: packing 1:M-count + 1-3 folding codepoints
>>
>> https://cr.openjdk.org/~sherman/casefolding_long/
>>
>> The performance is slightly better, but not as good as I would have expected. The access to codepoint from the long looks a little clumsy, but the logic looks smooth. need more work. opinion?
>>
>>
>> Benchmark Mode Cnt Score Error Units
>> StringCompareToFoldCase.asciiLower avgt 15 15.487 ± 0.298 ns/op
>> StringCompareToFoldCase.asciiLowerEQ avgt 15 10.005 ± 0.368 ns/op
>> StringCompareToFoldCase.asciiLowerEQFC avgt 15 10.755 ± 0.160 ns/op
>> StringCompareToFoldCase.asciiLowerFC avgt 15 10.349 ± 0.155 ns/op
>> StringCompareToFoldCase.asciiUpperLower avgt 15 12.188 ± 0.278 ns/op
>> StringCompareToFoldCase.asciiUpperLowerEQ avgt 15 10.901 ± 0.551 ns/op
>> StringCompareToFoldCase.asciiUpperLowerEQFC avgt 15 9.218 ± 0.165 ns/op
>> StringCompareToFoldCase.asciiUpperLowerFC avgt 15 9.335 ± 0.404 ns/op
>> StringCompareToFoldCase.asciiWithDFFC avgt 15 37.010 ± 0.518 ns/op
>> StringCompareToFoldCase.greekLower avgt 15 39.572 ± 0.098 ns/op
>> StringCompareToFoldCase.greekLowerEQ avgt 15 39.317 ± 0.104 ns/op
>> StringCompareToFoldCase.greekLowerEQFC avgt 15 20.428 ± 0.243 ns/op
>> StringCompareToFoldCase.greekLowerFC avgt 15 19.623 ± 0.141 ns/op
>> StringCompareToFoldCase.greekUpperLower avgt 15 7.105 ± 0.048 ns/op
>> StringCompareToFoldCase.greekUpperLowerEQ avgt 15 7.462 ± 0.092 ns/op
>> StringCompareToFoldCase.greekUpperLowerEQFC avgt 15 6.518 ± 0.128 ns/op
>> StringCompareToFoldCase.greekUpperLowerFC avgt 15 6.593 ± 0.240 ns/op
>> StringCompareToFoldCase.latin1UTF16 avgt 15 23.130 ± 0.152 ns/op
>> StringCompareToFoldCase.latin1UTF16EQ avgt 15 22.606 ± 0.089 ns/op
>> StringCompareToFoldCase.latin1UTF16EQFC avgt 15 29.574 ± 0.348 ns/op
>> StringCompareToFoldCase.latin1UTF16FC avgt 15 29.691 ± 0.445 ns/op
>> StringCompareToFoldCase.supLower avgt 15 55.027 ± 0.676 ns/op
>> StringCompareToFoldCase.supLowerEQ avgt 15 55.784 ± 0.368 ns/op
>> StringCompareToFoldCase.supLowerEQFC avgt 15 24.984 ± 0.157 ns/op
>> StringCompareToFoldCase.supLowerFC avgt 15 24.865 ± 0.139 ns/op
>> StringCompareToFoldCase.supUpperLower avgt 15 14.538 ± 0.144 ns/op
>> StringCompareToFoldCas...
>
>> Experimenting with Arrays.mismatch at the beginning of the array iteration as
>> ...
>> The benchmark results suggest that it does help 'dramatically' when the compared strings share with the same prefix. For example those "UpperLower" test cases (which shares the same upper cases text prefix. However it is also relatively expensive, with a 20%-ish overhead when the strings do not share the same string text but are case-insensitively equals. I would suggest let's leave it out for now?
>
>> ```
> Ok to leave it out for now. In similar contexts where System.arraycopy or Arrays.mismatch has some overhead I've suggested doing a simple check (like `size < 8`) to avoid the overhead when the strings/byte arrays are short.
> Thanks for checking.
> The performance is slightly better, but not as good as I would have expected. The access to codepoint from the long looks a little clumsy, but the logic looks smooth. need more work. opinion?
It does look cleaner without the array indexing in the loops.
Can the counting of characters (fcnt1,fcnt2) be eliminated by encoding 3 20-bit characters into the long and then checking `f1 != 0` to indicate there are more characters. Its a bit of an odd mix of 16-bit characters vs a single 20-bit char. Are there any 20-bit chars from or to folded replacements in the folding mappings?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2475481372
More information about the i18n-dev
mailing list