<i18n dev> RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v14]

Thu Feb 23 07:24:09 UTC 2023

On Wed, 22 Feb 2023 20:01:52 GMT, Eirik Bjorsnos <duke at openjdk.org> wrote:

>> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 'the oldest ASCII trick in the book'.
>> 
>> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is updated to use `equalsIgnoreCase`
>> 
>> To verify the correctness of `equalsIgnoreCase`, a new test is added  to `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 code point pairs have an `equalsIgnoreCase` consistent with Character.toUpperCase, Character.toLowerCase.
>> 
>> Performance is tested for matching and mismatching cases of code point pairs picked from the ASCII letter, ASCII number and latin1 letter ranges. Results in the first comment below.
>
> Eirik Bjorsnos has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 21 additional commits since the last revision:
> 
>  - Merge branch 'master' into regionmatches-latin1-speedup
>  - Merge branch 'master' into regionmatches-latin1-speedup
>  - Make the loop variables chars to avoid casting
>  - Use improved case-twiddling comment as suggested by Martin
>  - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase with "by removing (setting) a single bit"
>  - Align local variable naming in toLowerCase, toUpperCase with equalsIgnoreCase by using 'lower' and 'upper'
>  - Rename unconventionally named local variable 'U' to 'upper'
>  - Merge remote-tracking branch 'origin/master' into regionmatches-latin1-speedup
>  - Add whitespace between methods
>  - Merge branch 'master' into regionmatches-latin1-speedup
>  - ... and 11 more: https://git.openjdk.org/jdk/compare/31689be3...597b346a

I found this in Appendix A of the 1973 `Draft Proposed Revision of ASCII`. Seems compatibility with existing 6-bit devices might have been the primary concern:

A 6.4 It is expected that devices having the capability of
printing only 64 graphic symbols will continue to be important.
It may be desirable to arrange these devices to print one symbol
for the bit pattern of both upper and lower case of a given
alphabetic letter. To facilitate this, there should be a single-
bit difference between the upper and lowercase representations
of any given letter. Combined with the requirement that a given
case of the alphabet be contiguous, this dictated the assignment
of the alphabet, as shown in columns 4 through 7.

<img width="932" alt="ascii" src="https://user-images.githubusercontent.com/300291/220842000-3efa64fe-9154-4069-9e81-e202d5731f6f.png">

https://ia800606.us.archive.org/17/items/enf-ascii-1972-1975/Image070917152640_text.pdf

-------------

PR: https://git.openjdk.org/jdk/pull/12632