From sherman at openjdk.org Fri Oct 3 19:10:20 2025 From: sherman at openjdk.org (Xueming Shen) Date: Fri, 3 Oct 2025 19:10:20 GMT Subject: RFR: 8365675: Add String Unicode Case-Folding Support Message-ID: ### Summary Case folding is a key operation for case-insensitive matching (e.g., string equality, regex matching), where the goal is to eliminate case distinctions without applying locale or language specific conversions. Currently, the JDK does not expose a direct API for Unicode-compliant case folding. Developers now rely on methods such as: **String.equalsIgnoreCase(String)** - Unicode-aware, locale-independent. - Implementation uses Character.toLowerCase(Character.toUpperCase(int)) per code point. - Limited: does not support 1:M mapping defined in Unicode case folding. **Character.toLowerCase(int) / Character.toUpperCase(int)** - Locale-independent, single code point only. - No support for 1:M mappings. **String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)** - Based on Unicode SpecialCasing.txt, supports 1:M mappings. - Intended primarily for presentation/display, not structural case-insensitive matching. - Requires full string conversion before comparison, which is less efficient and not intended for structural matching. **1:M mapping example, U+00DF (?)** - String.toUpperCase(Locale.ROOT, "?") ? "SS" - Case folding produces "ss", matching Unicode caseless comparison rules. jshell> "\u00df".equalsIgnoreCase("ss") $22 ==> false jshell> "\u00df".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("ss") $24 ==> true ### Motivation & Direction Add Unicode standard-compliant case-less comparison methods to the String class, enabling & improving reliable and efficient Unicode-aware/compliant case-insensitive matching. - Unicode-compliant **full** case folding. - Simpler, stable and more efficient case-less matching without workarounds. - Brings Java's string comparison handling in line with other programming languages/libraries. This PR proposes to introduce the following comparison methods in `String` class - boolean equalsFoldCase(String anotherString) - int compareToFoldCase(String anotherString) - Comparator UNICODE_CASEFOLD_ORDER These methods are intended to be the preferred choice when Unicode-compliant case-less matching is required. *Note: An early draft also proposed a String.toCaseFold() method returning a new case-folded string. However, during review this was considered error-prone, as the resulting string could easily be mistaken for a general transformation like toLowerCase() and then passed into APIs where case-folding semantics are not appropriate. ### The New API /** * Compares this {@code String} to another {@code String} for equality, * using Unicode case folding. Two strings are considered equal * by this method if their case-folded forms are identical. *

* Case folding is defined by the Unicode Standard in * CaseFolding.txt, * including 1:M mappings. For example, {@code "Ma?e".equalsFoldCase("MASSE")} * returns {@code true}, since the character {@code U+00DF} (sharp s) folds * to {@code "ss"}. *

* Case folding is locale-independent and language-neutral, unlike * locale-sensitive transformations such as {@link #toLowerCase()} or * {@link #toUpperCase()}. It is intended for caseless matching, * searching, and indexing. * * @apiNote * This method is the Unicode-compliant alternative to * {@link #equalsIgnoreCase(String)}. It implements full case folding as * defined by the Unicode Standard, which may differ from the simpler * per-character mapping performed by {@code equalsIgnoreCase}. * For example: *

{@snippet lang=java :
     * String a = "Ma?e";
     * String b = "MASSE";
     * boolean equalsFoldCase = a.equalsFoldCase(b);       // returns true
     * boolean equalsIgnoreCase = a.equalsIgnoreCase(b);   // returns false
     * }

* * @param anotherString * The {@code String} to compare this {@code String} against * * @return {@code true} if the given object is not {@code null} and represents * the same sequence of characters as this string under Unicode case * folding; {@code false} otherwise. * * @see #compareToFoldCase(String) * @see #equalsIgnoreCase(String) * @since 26 */ public boolean equalsFoldCase(String anotherString) /** * Compares two strings lexicographically using Unicode case folding. * This method returns an integer whose sign is that of calling {@code compareTo} * on the Unicode case folded version of the strings. Unicode Case folding * eliminates differences in case according to the Unicode Standard, using the * mappings defined in * CaseFolding.txt, * including 1:M mappings, such as {@code"?"} ? {@code }"ss"}. *

* Case folding is a locale-independent, language-neutral form of case mapping, * primarily intended for caseless matching. Unlike {@link #compareToIgnoreCase(String)}, * which applies a simpler locale-insensitive uppercase mapping. This method * follows the Unicode full case folding, providing stable and * consistent results across all environments. *

* Note that this method does not take locale into account, and may * produce results that differ from locale-sensitive ordering. Use * {@link java.text.Collator} for locale-sensitive comparison. * * @apiNote * This method is the Unicode-compliant alternative to * {@link #compareToIgnoreCase(String)}. It implements the full case folding * as defined by the Unicode Standard, which may differ from the simpler * per-character mapping performed by {@code compareToIgnoreCase}. * For example: *

{@snippet lang=java :
     * String a = "Ma?e";
     * String b = "MASSE";
     * int cmpFoldCase = a.compareToFoldCase(b);     // returns 0
     * int cmpIgnoreCase = a.compareToIgnoreCase(b); // returns > 0
     * }

* * @param str the {@code String} to be compared. * @return a negative integer, zero, or a positive integer as the specified * String is greater than, equal to, or less than this String, * ignoring case considerations by case folding. * @see #equalsFoldCase(String) * @see #compareToIgnoreCase(String) * @see java.text.Collator * @since 26 */ public int compareToFoldCase(String str) /** * A Comparator that orders {@code String} objects as by * {@link #compareToFoldCase(String) compareToFoldCase()}. * * @see #compareToFoldCase(String) * @since 26 */ public static final Comparator UNICODE_CASEFOLD_ORDER; ### Usage Examples Sharp s (U+00DF) case-folds to "ss" "stra?e".equalsIgnoreCase("strasse"); // false "stra?e".compareToIgnoreCase("strasse"); // != 0 "stra?e".equalsFoldCase("strasse"); // true ### Performance The JMH microbenchmark StringCompareToIgnoreCase has been updated to compare performance of compareToFoldCase with the existing compareToIgnoreCase(). Benchmark Mode Cnt Score Error Units StringCompareToIgnoreCase.asciiGreekLower avgt 15 20.195 ? 0.300 ns/op StringCompareToIgnoreCase.asciiGreekLowerCF avgt 15 11.051 ? 0.254 ns/op StringCompareToIgnoreCase.asciiGreekUpperLower avgt 15 6.035 ? 0.047 ns/op StringCompareToIgnoreCase.asciiGreekUpperLowerCF avgt 15 14.786 ? 0.382 ns/op StringCompareToIgnoreCase.asciiLower avgt 15 17.688 ? 1.396 ns/op StringCompareToIgnoreCase.asciiLowerCF avgt 15 44.552 ? 0.155 ns/op StringCompareToIgnoreCase.asciiUpperLower avgt 15 13.069 ? 0.487 ns/op StringCompareToIgnoreCase.asciiUpperLowerCF avgt 15 58.684 ? 0.274 ns/op StringCompareToIgnoreCase.greekLower avgt 15 20.642 ? 0.082 ns/op StringCompareToIgnoreCase.greekLowerCF avgt 15 7.255 ? 0.271 ns/op StringCompareToIgnoreCase.greekUpperLower avgt 15 5.737 ? 0.013 ns/op StringCompareToIgnoreCase.greekUpperLowerCF avgt 15 11.100 ? 1.147 ns/op StringCompareToIgnoreCase.lower avgt 15 20.192 ? 0.044 ns/op StringCompareToIgnoreCase.lowerrCF avgt 15 11.257 ? 0.259 ns/op StringCompareToIgnoreCase.supLower avgt 15 54.801 ? 0.415 ns/op StringCompareToIgnoreCase.supLowerCF avgt 15 15.207 ? 0.418 ns/op StringCompareToIgnoreCase.supUpperLower avgt 15 14.431 ? 0.188 ns/op StringCompareToIgnoreCase.supUpperLowerCF avgt 15 19.149 ? 0.985 ns/op StringCompareToIgnoreCase.upperLower avgt 15 5.650 ? 0.051 ns/op StringCompareToIgnoreCase.upperLowerCF avgt 15 14.338 ? 0.352 ns/op StringCompareToIgnoreCase.utf16SubLower avgt 15 14.774 ? 0.200 ns/op StringCompareToIgnoreCase.utf16SubLowerCF avgt 15 2.669 ? 0.041 ns/op StringCompareToIgnoreCase.utf16SupUpperLower avgt 15 16.250 ? 0.099 ns/op StringCompareToIgnoreCase.utf16SupUpperLowerCF avgt 15 11.524 ? 0.327 ns/op ### Refs [Unicode Standard 5.18.4 Caseless Matching](https://www.unicode.org/versions/latest/core-spec/chapter-5/#G21790) [Unicode? Standard Annex #44: 5.6 Case and Case Mapping](https://www.unicode.org/reports/tr44/#Casemapping) [Unicode Technical Standard #18: Unicode Regular Expressions RL1.5: Simple Loose Matches](https://www.unicode.org/reports/tr18/#Simple_Loose_Matches) [Unicode SpecialCasing.txt](https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt) [Unicode CaseFolding.txt](https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt) ### Other Languages **Python string.casefold()** The str.casefold() method in Python returns a casefolded version of a string. Casefolding is a more aggressive form of lowercasing, designed to remove all case distinctions in a string, particularly for the purpose of caseless string comparisons. **Perl?s fc()** Returns the casefolded version of EXPR. This is the internal function implementing the \F escape in double-quoted strings. Casefolding is the process of mapping strings to a form where case differences are erased; comparing two strings in their casefolded form is effectively a way of asking if two strings are equal, regardless of case. Perl only implements the full form of casefolding, but you can access the simple folds using "casefold()" in Unicode::UCD] ad "prop_invmap()" in Unicode::UCD]. **ICU4J UCharacter.foldCase (Java)** Purpose: Provides extensions to the standard Java Character class, including support for more Unicode properties and handling of supplementary characters (code points beyond U+FFFF). Method Signature (String based): public static String foldCase(String str, int options) Method Signature (CharSequence & Appendable based): public static A foldCase(CharSequence src, A dest, int options, Edits edits) Key Features: Case Folding: Converts a string to its case-folded equivalent. Locale Independent: Case folding in UCharacter.foldCase is generally not dependent on locale settings. Context Insensitive: The mapping of a character is not affected by surrounding characters. Turkic Option: An option exists to include or exclude special mappings for Turkish/Azerbaijani text. Result Length: The resulting string can be longer or shorter than the original. Edits Recording: Allows for recording of edits for index mapping, styled text, and getting only changes. **u_strFoldCase (C/C++)** A lower-level C API function for case folding a string. Case Folding Options: Similar options as UCharacter.foldCase for controlling case folding behavior. Availability: Found in the ustring.h and unistr.h headers in the ICU4C library. ------------- Commit messages: - 8365675: Add String Unicode Case-Folding Support Changes: https://git.openjdk.org/jdk/pull/26892/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26892&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8365675 Stats: 1279 lines in 12 files changed: 1116 ins; 137 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/26892.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26892/head:pull/26892 PR: https://git.openjdk.org/jdk/pull/26892