<i18n dev> RFR: 8365675: Add String Unicode Case-Folding Support
Roger Riggs
rriggs at openjdk.org
Tue Oct 7 22:25:08 UTC 2025
On Fri, 3 Oct 2025 19:56:22 GMT, Xueming Shen <sherman at openjdk.org> wrote:
> ### Summary
>
> Case folding is a key operation for case-insensitive matching (e.g., string equality, regex matching), where the goal is to eliminate case distinctions without applying locale or language specific conversions.
>
> Currently, the JDK does not expose a direct API for Unicode-compliant case folding. Developers now rely on methods such as:
>
> **String.equalsIgnoreCase(String)**
>
> - Unicode-aware, locale-independent.
> - Implementation uses Character.toLowerCase(Character.toUpperCase(int)) per code point.
> - Limited: does not support 1:M mapping defined in Unicode case folding.
>
> **Character.toLowerCase(int) / Character.toUpperCase(int)**
>
> - Locale-independent, single code point only.
> - No support for 1:M mappings.
>
> **String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)**
>
> - Based on Unicode SpecialCasing.txt, supports 1:M mappings.
> - Intended primarily for presentation/display, not structural case-insensitive matching.
> - Requires full string conversion before comparison, which is less efficient and not intended for structural matching.
>
> **1:M mapping example, U+00DF (ß)**
>
> - String.toUpperCase(Locale.ROOT, "ß") → "SS"
> - Case folding produces "ss", matching Unicode caseless comparison rules.
>
>
> jshell> "\u00df".equalsIgnoreCase("ss")
> $22 ==> false
>
> jshell> "\u00df".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("ss")
> $24 ==> true
>
>
> ### Motivation & Direction
>
> Add Unicode standard-compliant case-less comparison methods to the String class, enabling & improving reliable and efficient Unicode-aware/compliant case-insensitive matching.
>
> - Unicode-compliant **full** case folding.
> - Simpler, stable and more efficient case-less matching without workarounds.
> - Brings Java's string comparison handling in line with other programming languages/libraries.
>
> This PR proposes to introduce the following comparison methods in `String` class
>
> - boolean equalsFoldCase(String anotherString)
> - int compareToFoldCase(String anotherString)
> - Comparator<String> UNICODE_CASEFOLD_ORDER
>
> These methods are intended to be the preferred choice when Unicode-compliant case-less matching is required.
>
> *Note: An early draft also proposed a String.toCaseFold() method returning a new case-folded string.
> However, during review this was considered error-prone, as the resulting string could easily be mistaken for a general transformation like toLowerCase() and then passed into APIs where case-folding semantics are not appropriate.
>
> ### The New API
>
>
> /**
> * Compares thi...
The API looks good.
Is the performance comparable to equalsIgnoreCase?
src/java.base/share/classes/java/lang/StringLatin1.java line 194:
> 192: char[] folded2 = null;
> 193: int k1 = 0, k2 = 0, fk1 = 0, fk2 = 0;
> 194: while ((k1 < len1 || folded1 != null && fk1 < folded1.length) &&
Many suggestions come to mind here on the algorithm, to optimize performance.
For example, many strings will have identical prefixes. Using Arrays.mismatch could quickly skip over the identical prefix.
Consider using code points (or a long, packing 4 chars) for the folded replacements, to avoid having to step through chars in char arrays. CaseFolding.foldIfDefined could return the full expansion as a long.
It may be profitable to use Arrays.mismatch again after expanded characters are determined to be equal.
Take another look at the data structure storing and doing the lookup of foldIfDefined both to increase the lookup performance.
src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 230:
> 228: private static class CaseFoldingEntry {
> 229: final int cp;
> 230: final char[] folding;
Consider storing the folding as a int or long directly to avoid the overhead of small char arrays.
Arrange to be able to compare the whole replacement with another codePoint, etc.
src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 280:
> 278: }
> 279:
> 280: private void add(CaseFoldingEntry entry) {
CDS can map whole objects/data structures into the heap; consider how to make this data structure so it can be mapped and not re-computed each startup.
-------------
PR Review: https://git.openjdk.org/jdk/pull/27628#pullrequestreview-3312084027
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2412043131
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2412060747
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2412062604
More information about the i18n-dev
mailing list