<i18n dev> RFR: 8302872: Speed up StringLatin1.regionMatchesCI_UTF16
Eirik Bjorsnos
duke at openjdk.org
Tue Feb 28 09:44:00 UTC 2023
This PR continues the efforts from #12632 to speed up case-insensitive string matching.
We now tackle case-insensitive comparison of mixed-coder strings, implemented in `StringLatin1.regionMatchesCI_UTF16`
Key insights:
- If the UTF16 code point is also in latin1 range, we can leverage improvements from 12632 directly by calling `CharacterDataLatin1.equalsIgnoreCase`
- There are exactly 7 non-latin1 Unicode code points which case fold into the latin1 range. We can special-case our comparison of these code points by adding the method `CharacterDataLatin1.latin1CaseFold`.
- To avoid checking of `a == b` twice, this check is lifted out of `CharacterDataLatin1.equalsIgnoreCase` and the two callers are updated to check that `a != b` before calling the method.
For completeness, the RegionMatches test is updated to also compare Turkic dotted/dotless 'i's against the uppercase ASCII 'I', not just the lowercase one. Not stricktly related to the purpose of this PR, but it did help catch a regression introduced in an earlier iteration of the PR.
To guard against regressions caused by future changes to the set of Unicode code points folding into latin1, a new test is added to `EqualsIgnoreCase` which identifies all such code points and verifies they are compared correcty.
Performance is tested for matching and mismatching cases of selected code point pairs picked from the ASCII letter, ASCII number, latin1 letter and non-latin Unicode letter ranges. Results in the first comment below.
-------------
Commit messages:
- Inline local variable
- latin1CaseFold was moved to CharacterDataLatin1
- Move latin1CaseFold to CharacterDataLatin1
- Improve latin1CaseFold javadocs
- Simplify comments
- Prefer fast matching by comparing for equality before checking latin1 range
- Improve Javadocs of latin1CaseFold
- Be consistent in comments
- CharacterData.latin1LowerCase was renamed to latin1CaseFold
- Hoist equality check out of CharacterDataLatin1.equalsIgnoreCase
- ... and 13 more: https://git.openjdk.org/jdk/compare/f2b03f9a...92755920
Changes: https://git.openjdk.org/jdk/pull/12637/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12637&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8302872
Stats: 169 lines in 5 files changed: 155 ins; 2 del; 12 mod
Patch: https://git.openjdk.org/jdk/pull/12637.diff
Fetch: git fetch https://git.openjdk.org/jdk pull/12637/head:pull/12637
PR: https://git.openjdk.org/jdk/pull/12637
More information about the i18n-dev
mailing list