<i18n dev> RFR: 8302872: Speed up StringLatin1.regionMatchesCI_UTF16

Eirik Bjorsnos duke at openjdk.org
Tue Feb 28 09:44:00 UTC 2023


This PR continues the efforts from #12632 to speed up case-insensitive string matching.

We now tackle case-insensitive comparison of mixed-coder strings, implemented in `StringLatin1.regionMatchesCI_UTF16`

Key insights:

- If the UTF16 code point is also in latin1 range, we can leverage improvements from 12632 directly by calling `CharacterDataLatin1.equalsIgnoreCase`
- There are exactly 7 non-latin1 Unicode code points which case fold into the latin1 range. We can special-case our comparison of these code points by adding the method `CharacterDataLatin1.latin1CaseFold`.
- To avoid checking of `a == b` twice, this check is lifted out of `CharacterDataLatin1.equalsIgnoreCase` and the two callers are updated to check that `a != b` before calling the method. 
 
For completeness, the RegionMatches test is updated to also compare Turkic dotted/dotless 'i's against the uppercase ASCII 'I', not just the lowercase one.  Not stricktly related to the purpose of this PR, but it did help catch a regression introduced in an earlier iteration of the PR.   

To guard against regressions caused by future changes to the set of Unicode code points folding into latin1, a new test is added to `EqualsIgnoreCase` which identifies all such code points and verifies they are compared correcty.

Performance is tested for matching and mismatching cases of selected code point pairs picked from the ASCII letter, ASCII number, latin1 letter and non-latin Unicode letter ranges. Results in the first comment below.

-------------

Commit messages:
 - Inline local variable
 - latin1CaseFold was moved to CharacterDataLatin1
 - Move latin1CaseFold to CharacterDataLatin1
 - Improve latin1CaseFold javadocs
 - Simplify comments
 - Prefer fast matching by comparing for equality before checking latin1 range
 - Improve Javadocs of latin1CaseFold
 - Be consistent in comments
 - CharacterData.latin1LowerCase was renamed to latin1CaseFold
 - Hoist equality check out of CharacterDataLatin1.equalsIgnoreCase
 - ... and 13 more: https://git.openjdk.org/jdk/compare/f2b03f9a...92755920

Changes: https://git.openjdk.org/jdk/pull/12637/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12637&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8302872
  Stats: 169 lines in 5 files changed: 155 ins; 2 del; 12 mod
  Patch: https://git.openjdk.org/jdk/pull/12637.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12637/head:pull/12637

PR: https://git.openjdk.org/jdk/pull/12637


More information about the i18n-dev mailing list