RFR: 8365675: Add String Unicode Case-Folding Support [v3]
Xueming Shen
sherman at openjdk.org
Sat Oct 18 08:54:01 UTC 2025
On Tue, 7 Oct 2025 22:10:50 GMT, Roger Riggs <rriggs at openjdk.org> wrote:
>> Xueming Shen has updated the pull request incrementally with one additional commit since the last revision:
>>
>> performance update
>
> src/java.base/share/classes/java/lang/StringLatin1.java line 194:
>
>> 192: char[] folded2 = null;
>> 193: int k1 = 0, k2 = 0, fk1 = 0, fk2 = 0;
>> 194: while ((k1 < len1 || folded1 != null && fk1 < folded1.length) &&
>
> Many suggestions come to mind here on the algorithm, to optimize performance.
> For example, many strings will have identical prefixes. Using Arrays.mismatch could quickly skip over the identical prefix.
> Consider using code points (or a long, packing 4 chars) for the folded replacements, to avoid having to step through chars in char arrays. CaseFolding.foldIfDefined could return the full expansion as a long.
> It may be profitable to use Arrays.mismatch again after expanded characters are determined to be equal.
>
> Take another look at the data structure storing and doing the lookup of foldIfDefined both to increase the lookup performance.
With the following changes
- Use int[] / code point storage instead of char[] / char for folding data.
- Add a fast path for Latin-1 vs. Latin-1 comparisons.
- Speed up the UTF-16 path as well.
The only remaining “slow” path is Latin-1 vs. UTF-16 for now.
Would appreciate some eagle eyes on the new fast-path implementation. The tests suggest it’s working as expected :-)
**StringCompareToFoldCase.latin1UTF16 avgt 15 23.959 ± 2.357 ns/op
StringCompareToFoldCase.latin1UTF16EQ avgt 15 22.455 ± 0.104 ns/op
StringCompareToFoldCase.latin1UTF16EQFC avgt 15 32.782 ± 0.962 ns/op
StringCompareToFoldCase.latin1UTF16FC avgt 15 32.581 ± 0.725 ns/op**
Benchmark Mode Cnt Score Error Units
StringCompareToFoldCase.asciiLower avgt 15 17.983 ± 0.208 ns/op
StringCompareToFoldCase.asciiLowerEQ avgt 15 9.986 ± 0.254 ns/op
StringCompareToFoldCase.asciiLowerEQFC avgt 15 10.781 ± 0.161 ns/op
StringCompareToFoldCase.asciiLowerFC avgt 15 10.274 ± 0.136 ns/op
StringCompareToFoldCase.asciiUpperLower avgt 15 12.465 ± 0.409 ns/op
StringCompareToFoldCase.asciiUpperLowerEQ avgt 15 10.961 ± 0.407 ns/op
StringCompareToFoldCase.asciiUpperLowerEQFC avgt 15 9.157 ± 0.166 ns/op
StringCompareToFoldCase.asciiUpperLowerFC avgt 15 9.228 ± 0.254 ns/op
StringCompareToFoldCase.asciiWithDFFC avgt 15 57.603 ± 1.540 ns/op
StringCompareToFoldCase.greekLower avgt 15 39.593 ± 0.128 ns/op
StringCompareToFoldCase.greekLowerEQ avgt 15 39.249 ± 0.032 ns/op
StringCompareToFoldCase.greekLowerEQFC avgt 15 13.993 ± 0.346 ns/op
StringCompareToFoldCase.greekLowerFC avgt 15 14.030 ± 0.454 ns/op
StringCompareToFoldCase.greekUpperLower avgt 15 7.137 ± 0.130 ns/op
StringCompareToFoldCase.greekUpperLowerEQ avgt 15 7.496 ± 0.104 ns/op
StringCompareToFoldCase.greekUpperLowerEQFC avgt 15 6.203 ± 0.316 ns/op
StringCompareToFoldCase.greekUpperLowerFC avgt 15 6.235 ± 0.256 ns/op
StringCompareToFoldCase.latin1UTF16 avgt 15 23.959 ± 2.357 ns/op
StringCompareToFoldCase.latin1UTF16EQ avgt 15 22.455 ± 0.104 ns/op
StringCompareToFoldCase.latin1UTF16EQFC avgt 15 32.782 ± 0.962 ns/op
StringCompareToFoldCase.latin1UTF16FC avgt 15 32.581 ± 0.725 ns/op
StringCompareToFoldCase.supLower avgt 15 54.918 ± 0.809 ns/op
StringCompareToFoldCase.supLowerEQ avgt 15 55.080 ± 0.511 ns/op
StringCompareToFoldCase.supLowerEQFC avgt 15 14.085 ± 0.996 ns/op
StringCompareToFoldCase.supLowerFC avgt 15 13.907 ± 0.612 ns/op
StringCompareToFoldCase.supUpperLower avgt 15 14.479 ± 0.175 ns/op
StringCompareToFoldCase.supUpperLowerEQ avgt 15 14.645 ± 0.059 ns/op
StringCompareToFoldCase.supUpperLowerEQFC avgt 15 9.200 ± 0.310 ns/op
StringCompareToFoldCase.supUpperLowerFC avgt 15 9.041 ± 0.248 ns/op
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2441849469
More information about the build-dev
mailing list