<i18n dev> RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

Tue Feb 21 20:28:30 UTC 2023

On Tue, 21 Feb 2023 14:27:03 GMT, Alan Bateman <alanb at openjdk.org> wrote:

>> Eirik Bjorsnos has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Remove whitespace following '('
>
> src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 175:
> 
>> 173:          }
>> 174:          // uppercase b1 using 'the oldest ASCII trick in the book'
>> 175:          int U = b1 & 0xDF;
> 
> I'm sure some people reading this comment will wonder which book :-) It might be better to drop that bit and if possible, find a better name for "U" as normally variables start with a lower case.

Hi Alan,

I thought I was clever by encoding the 'uppercaseness' in the variable name, but yeah I'll find a better name :)

There is some precedent for using the 'ASCII trick' comment in the JDK.  I found it in ZipFile.isMetaName, which is also where I first learned about this interesting relationship between ASCII (and also latin1) letters.

The comment was first added by Martin Buchholz back in 2016 as part of JDK-8157069, 'Assorted ZipFile improvements'. In 2020, Claes was updating this code and Lance has some input about clarifying the comment. Martin then [chimed in](https://mail.openjdk.org/pipermail/core-libs-dev/2020-May/066363.html) to defend his comment:

> I still like my ancient "ASCII trick" comment.

I think this 'trick', whatever we call it, is sufficiently intricate that it deserves to be called out somehow and that we should not just casually bitmask with these magic constants without any discussion at all. 

An earlier iteration of this PR included a small essay in the javadoc of this method describing the layout and relationship of letters in latin1 and how we can apply that knowledge of the layout to implement the method.

How would you feel about adding that description back to the Javadocs? This would then live close to the similarly implemented toUpperCase and toLowerCase methods currently under review in #12623. 

Here's the updated discussion included in the Javadoc:

    /**
     * Compares two latin1 code points, ignoring case considerations.
     *
     * Implementation note: In ISO/IEC 8859-1, the uppercase and lowercase
     * letters are found in the following code point ranges:
     *
     * 0x41-0x5A: Uppercase ASCII letters: A-Z
     * 0x61-0x7A: Lowercase ASCII letters: a-z
     * 0xC0-0xD6: Uppercase latin1 letters: A-GRAVE - O with Diaeresis
     * 0xD8-0xDE: Uppercase latin1 letters: O with slash - Thorn
     * 0xE0-0xF6: Lowercase latin1 letters: a-grave - o with Diaeresis
     * 0xF8-0xFE: Lowercase latin1 letters: o with slash - thorn
     *
     * While both ASCII letter ranges are contiguous, the latin1 ranges are not:
     *
     * The 'multiplication sign' 0xD7 splits the uppercase range in two.
     * The 'division sign' 0xF7 splits the lowercase range in two.
     *
     * Lowercase letters are found 32 positions (0x20) after their corresponding uppercase letter.
     * The 'division sign' and 'multiplication sign' have the same relative distance.
     *
     * Since 0x20 is a single bit, we can apply the 'oldest ASCII trick in the book' to
     * lowercase any letter by setting the bit:
     *
     * ('C' | 0x20) == 'c'
     *
     * By removing the bit, we can perform the uppercase operation:
     *
     * ('c' & 0xDF) == 'C'
     *
     * Applying this knowledge of the latin1 layout, we can test for equality ignoring case by
     * checking that the code points are either equal, or that one of the code points is a letter
     * which uppercases is the same as the uppercase of the other code point.
     *
     * @param b1 byte representing a latin1 code point
     * @param b2 another byte representing a latin1 code point
     * @return true if the two bytes are considered equals ignoring case in latin1
     */
     static boolean equalsIgnoreCase(byte b1, byte b2) {
         if (b1 == b2) {
             return true;
         }
         int upper = b1 & 0xDF;
         if (upper < 'A') {
             return false;  // Low ASCII
         }
         return (upper <= 'Z' // In range A-Z
                 || (upper >= 0xC0 && upper <= 0XDE && upper != 0xD7)) // ..or A-grave-Thorn, excl. multiplication
                 && upper == (b2 & 0xDF); // b2 has same uppercase
    }

-------------

PR: https://git.openjdk.org/jdk/pull/12632