RFR: 8365675: Add String Unicode Case-Folding Support [v2]
Bernd
duke at openjdk.org
Wed Oct 8 16:35:05 UTC 2025
On Wed, 8 Oct 2025 00:33:20 GMT, Xueming Shen <sherman at openjdk.org> wrote:
>> ### Summary
>>
>> Case folding is a key operation for case-insensitive matching (e.g., string equality, regex matching), where the goal is to eliminate case distinctions without applying locale or language specific conversions.
>>
>> Currently, the JDK does not expose a direct API for Unicode-compliant case folding. Developers now rely on methods such as:
>>
>> **String.equalsIgnoreCase(String)**
>>
>> - Unicode-aware, locale-independent.
>> - Implementation uses Character.toLowerCase(Character.toUpperCase(int)) per code point.
>> - Limited: does not support 1:M mapping defined in Unicode case folding.
>>
>> **Character.toLowerCase(int) / Character.toUpperCase(int)**
>>
>> - Locale-independent, single code point only.
>> - No support for 1:M mappings.
>>
>> **String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)**
>>
>> - Based on Unicode SpecialCasing.txt, supports 1:M mappings.
>> - Intended primarily for presentation/display, not structural case-insensitive matching.
>> - Requires full string conversion before comparison, which is less efficient and not intended for structural matching.
>>
>> **1:M mapping example, U+00DF (ß)**
>>
>> - String.toUpperCase(Locale.ROOT, "ß") → "SS"
>> - Case folding produces "ss", matching Unicode caseless comparison rules.
>>
>>
>> jshell> "\u00df".equalsIgnoreCase("ss")
>> $22 ==> false
>>
>> jshell> "\u00df".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("ss")
>> $24 ==> true
>>
>>
>> ### Motivation & Direction
>>
>> Add Unicode standard-compliant case-less comparison methods to the String class, enabling & improving reliable and efficient Unicode-aware/compliant case-insensitive matching.
>>
>> - Unicode-compliant **full** case folding.
>> - Simpler, stable and more efficient case-less matching without workarounds.
>> - Brings Java's string comparison handling in line with other programming languages/libraries.
>>
>> This PR proposes to introduce the following comparison methods in `String` class
>>
>> - boolean equalsFoldCase(String anotherString)
>> - int compareToFoldCase(String anotherString)
>> - Comparator<String> UNICODE_CASEFOLD_ORDER
>>
>> These methods are intended to be the preferred choice when Unicode-compliant case-less matching is required.
>>
>> *Note: An early draft also proposed a String.toCaseFold() method returning a new case-folded string.
>> However, during review this was considered error-prone, as the resulting string could easily be mistaken for a general transformation like toLowerCase() and then pass...
>
> Xueming Shen has updated the pull request incrementally with one additional commit since the last revision:
>
> minor api doc updates
Great progress thanks. Did you also consider a startsWith/containsCaseFold, I missed the case ignoring variants of those already. Or maybe provide an API to implement them on the cases folded intermediate buffers? If the API footprint gets too big on String as CaseFoldString.contains() helper maybe?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/27628#issuecomment-3382351349
More information about the core-libs-dev
mailing list