Proposed: StringUTF16 bug fix with optimization - Part 1 of 2, StringUTF16 Patch
Naoto Sato
naoto.sato at oracle.com
Wed Mar 31 21:20:55 UTC 2021
Hi Chris,
Thank you for your contribution. I believe this can be divided into two
parts, one is the bug in the current implementation, and the other is
the enhancement to refactor the whole implementation for performance. I
have created two JIRA issues for each:
https://bugs.openjdk.java.net/browse/JDK-8264544
https://bugs.openjdk.java.net/browse/JDK-8264545
The former one is really an embarassing one that I would want to fix
right away. I think the latter one need more eyes. Comments from others
are welcome.
Naoto
On 3/30/21 4:44 PM, Chris Johnson wrote:
> Historically, the methods currently known as "compareToCI" and
> "regionMatchesCI", and located in "java.lang.StringUTF16", have lacked
> support for Supplementary Multilingual Plane code-points. (I've seen no
> associated bug.)
>
> On July 23, 2020 the first fix for the bug was committed. However, it
> includes two simple bugs of its own. They're not much more than typos,
> but they break some things nonetheless, as demonstrated by the unit
> tests comprising part 2 of this contribution.
>
> (Those two bugs: In "StringUTF16.compareToCIImpl", change statements
> "cp1 -= cp1;" and "cp2 -= cp2;" to, respectively, "cp1 = -cp1;" and
> "cp2 = -cp2;", and those bugs are history.)
>
> The fixed "compareToCI" and "regionMatchesCI" methods in this patch are
> different, however.[1] Notably:
>
> 1. Case-insensitive UTF-16 string comparison and equality-testing is
> 2.2 to 2.9 times faster than the current methods, depending on data
> set composition. (So, meaningfully faster, but the degree varies
> with the strings compared.)
>
> 2. 100% TestNG unit test coverage.
>
> 3. A utility method, "compareCodePointsIgnoringCase", created for these
> methods increased their performance by 24%, and could, in the future,
> be applied easily to "StringLatin1" methods "compareToCI*", and
> "regionMatchesCI*". (My guess is the performance gain there will be
> smaller, but still useful.)
>
> This patch to "StringUTF16" may appear quite large. For better or worse
> (your call, gentle reader), the vast majority of it is comments. The code
> itself is minimal by comparison.
>
> Thanks in advance for your consideration,
>
> ----Chris
>
> Chris W. Johnson
> chriswjohnson.jdk at gmail.com
> http://www.panojohnson.com/
>
>
> Footnote:
>
> 1. Work on this code began around 4 years ago, when I failed to find a
> bug report link on the OpenJDK page, and supposed the message to devs
> might be something like "contribute or go away". Since then, life,
> death, work, learning to build & test OpenJDK, registering as a
> contributor, social anxiety, unit test development, benchmarking,
> experimentation, and being unable to look at my own code without
> seeing *something* to improve (especially considering it might end-up
> in the JDK), are among the reasons for the long delay in attempting
> this contribution.
>
>
>
>
>
>
> Index: src/java.base/share/classes/java/lang/StringUTF16.java
> IDEA additional info:
> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> <+>US-ASCII
> ===================================================================
> diff --git a/src/java.base/share/classes/java/lang/StringUTF16.java
> b/src/java.base/share/classes/java/lang/StringUTF16.java
> --- a/src/java.base/share/classes/java/lang/StringUTF16.java (revision
> 60819:ee1d592a9f5389725a0338a4b5dfcf4fc3fcf20c)
> +++ b/src/java.base/share/classes/java/lang/StringUTF16.java (revision
> 60819+:ee1d592a9f53+)
> @@ -41,7 +41,32 @@
> import static java.lang.String.LATIN1;
>
> final class StringUTF16 {
> -
> +
> + /**
> + * Number of low-order bits storing the payload of Unicode surrogate
> code-units.
> + * (The number of surrogate header bits is {@link Character#SIZE}
> minus this value.)
> + */
> + private static final int SURROGATE_PAYLOAD_BIT_COUNT = 10;
> +
> + /**
> + * The high six bits common to all high-surrogate code-units (bits
> 15-10) right-shifted into bits
> + * 5-0 of an "{@code int}" primitive. ({@code 0b1101_10**_****_****
>>>> 10 == 0b11_0110 == 0x36})
> + */
> + private static final int SURROGATE_HI_HEADER_RIGHT_SHIFTED_10 =
> Character.MIN_HIGH_SURROGATE >>> SURROGATE_PAYLOAD_BIT_COUNT; // == 0x36
> +
> + /**
> + * Produces a Supplementary Multilingual Plane code-point when added
> to a valid surrogate-pair
> + * combined as follows: {@code (highSurrogate <<
> SURROGATE_PAYLOAD_BIT_COUNT) + lowSurrogate}.
> + * The result is undefined if either would-be surrogate is invalid
> (not a surrogate code-unit,
> + * or the wrong surrogate type [low instead of high, or vice versa]).
> + *
> + * @see Character#toCodePoint(char, char)
> + */
> + private static final int SURROGATE_COMBO_TO_CODE_POINT_DELTA =
> Character.MIN_SUPPLEMENTARY_CODE_POINT
> + -
> (Character.MIN_HIGH_SURROGATE << SURROGATE_PAYLOAD_BIT_COUNT)
> + -
> Character.MIN_LOW_SURROGATE; // -56_613_888
> +
> +
> public static byte[] newBytesFor(int len) {
> if (len < 0) {
> throw new NegativeArraySizeException();
> @@ -318,93 +343,429 @@
> return -StringLatin1.compareToUTF16(other, value, len2, len1);
> }
>
> - public static int compareToCI(byte[] value, byte[] other) {
> - return compareToCIImpl(value, 0, length(value), other, 0,
> length(other));
> + /**
> + * Compares case-insensitively UTF-16 sequences in {@code byte}
> arrays. Performs
> + * {@linkplain #compareCodePointsIgnoringCase case conversions
> equivalent to
> + *
> <code>Character.toLowerCase(Character.toUpperCase(codePoint))</code>} before
> + * comparing unequal code-points.
> + *
> + * @param charsA array of byte pairs representing 16-bit ({@code
> char}) quantities,
> + * with byte order mandated by {@code
> StringUTF16.isBigEndian()}
> + * @param charsB array of byte pairs representing 16-bit ({@code
> char}) quantities,
> + * with byte order mandated by {@code
> StringUTF16.isBigEndian()}
> + *
> + * @return negative if {@code charsA} is lexicographically less than
> {@code charsB},
> + * positive if it is greater, or zero if they are equal
> + */
> + static int compareToCI
> + (
> + final byte[] charsA,
> + final byte[] charsB
> + ){
> + // Consider: If the G1 garbage collector's string-deduplication mode
> is enabled, performance may
> + // be improved by first testing "charsA" and "charsB" for
> reference equality, and
> + // returning zero in that case. However, the same is
> equally true for all invocations
> + // of "String.compareToIgnoreCase" (not just those
> operating on two UTF-16 strings),
> + // so the test belongs in that method (ideally with magic
> causing the array reference
> + // equality test to occur only when G1 and its
> string-deduplication mode are active).
> + // Similarly modifying methods
> "String.equalsIgnoreCase" and "String.equals" may be
> + // even more beneficial, because they already include
> "String" object reference equality
> + // fast-paths, unlike "String.compareToIgnoreCase".
> + // Indeed, string-deduplication makes reference
> equality more likely between backing-
> + // store arrays than "String" objects, and therefore the
> more important fast-path test,
> + // provided real-world use of those methods frequently
> involves "String" objects whose
> + // tenure (number of garbage-collections survived) is >=
> the deduplication threshold (3
> + // by default, as of JDK 15, or the
> "-XX:StringDeduplicationAgeThreshold" value).
> +
> + final int lengthSignum = charsA.length -
> charsB.length;
> +
> + // If the array lengths are identical, return the "compareToCI"
> result; no additional
> + // considerations apply.
> +
> + if (lengthSignum == 0)
> + return compareToCI(charsA, 0, charsB, 0, length(charsA));
> +
> + // The array lengths differ. Compare the elements present in both
> "charsA" and "charsB".
> + // If the comparison result is inequality, return the comparison
> result. If the result
> + // is equality, the shorter array is considered less than the longer
> array.
> +
> + final int compareSignum = compareToCI(charsA, 0,
> charsB, 0, length(lengthSignum <= 0 ? charsA : charsB));
> +
> + // The right-shift (division by two) applied to "lengthSignum" below
> is necessary only
> + // if compatibility is required with code expecting specific return
> values (instead of
> + // any negative or positive value) from known comparisons.
> +
> + return compareSignum != 0 ? compareSignum : lengthSignum >> 1;
> }
> -
> - private static int compareToCIImpl(byte[] value, int toffset, int tlen,
> - byte[] other, int ooffset, int olen)
> {
> - int tlast = toffset + tlen;
> - int olast = ooffset + olen;
> - assert toffset >= 0 && ooffset >= 0;
> - assert tlast <= length(value);
> - assert olast <= length(other);
> -
> - for (int k1 = toffset, k2 = ooffset; k1 < tlast && k2 < olast;
> k1++, k2++) {
> - int cp1 = (int)getChar(value, k1);
> - int cp2 = (int)getChar(other, k2);
> -
> - if (cp1 == cp2 || compareCodePointCI(cp1, cp2) == 0) {
> +
> + /**
> + * Compares case-insensitively UTF-16 sequences in {@code byte} array
> subsections.
> + * Performs {@linkplain #compareCodePointsIgnoringCase case
> conversions equivalent
> + * to
> <code>Character.toLowerCase(Character.toUpperCase(codePoint))</code>} before
> + * comparing unequal code-points.
> + *
> + * @param charsA array of byte pairs representing 16-bit ({@code
> char}) quantities,
> + * with byte order mandated by {@code
> StringUTF16.isBigEndian()}
> + * @param charsAIndex index, in 16-bit quantities, at which comparison
> of "{@code charsA}"
> + * begins. Caller must enforce constraints "{@code
> 0 <= charsAIndex}"
> + * and "{@code 0 <= charsAIndex + totalChars <=
> charsA.length / 2}".
> + * @param charsB array of byte pairs representing 16-bit ({@code
> char}) quantities,
> + * with byte order mandated by {@code
> StringUTF16.isBigEndian()}
> + * @param charsBIndex index, in 16-bit quantities, at which comparison
> of "{@code charsB}"
> + * begins. Caller must enforce constraints "{@code
> 0 <= charsBIndex}"
> + * and "{@code 0 <= charsBIndex + totalChars <=
> charsB.length / 2}".
> + * @param totalChars maximum number of {@code char}s to compare.
> Caller must enforce
> + * constraint "{@code 0 <= totalChars}".
> + *
> + * @return negative if the compared portion of {@code charsA} is
> lexicographically less
> + * than that of {@code charsB}, positive if it is greater, or zero if
> they are equal
> + *
> + * @implNote <p>This method relies on the following "<b>Unicode
> Empirical Properties</b>":</p>
> + * <ol>
> + * <li>
> + * <p>Only code-points encoded using the same UTF-16
> high-surrogate may be case-insensitively equal.</p>
> + * <p>Several facts suggest sufficient proximity of code-point
> case-variants within the Supplementary
> + * Multilingual Plane (SMP):</p>
> + * <ul>
> + * <li>Related code-points are grouped into a Unicode
> "block".</li>
> + * <li>Blocks start at multiples of 256.</li>
> + * <li>High-surrogates encode 1,024 code-points starting
> with multiples of 1,024.</li>
> + * <li>Few SMP code-points are case-sensitive (450, as of
> Unicode 13.0.0).</li>
> + * </ul>
> + * <p>Thus, it is not surprising this property has held for 25
> years, as of February, 2021.</p>
> + * <li><p>For all SMP code-points, both {@code
> Character.toLowerCase(Character.toUpperCase(int))} and
> + * {@code Character.toUpperCase(int)} return a SMP code-point,
> never a Basic Multilingual Plane (BMP)
> + * code-point.</p>
> + * <p>Supporting points:</p>
> + * <ul>
> + * <li>This is expected due to related code-point
> grouping.</li>
> + * <li>If this property were ever invalidated, every
> Unicode case-insensitive equality test
> + * which requires equal-length input sequences (measured
> in {@code byte}s or {@code char}s)
> + * would break. In other words, this property is already
> utilized more-or-less universally by
> + * Unicode case-insensitive equality tests, whether or not
> their authors realize it. The
> + * Unicode Consortium has likely been aware of, and
> sensitive to, this issue for years.
> + * </li>
> + * </ul>
> + * </li>
> + * <li><p>For all BMP code-points, both {@code
> Character.toLowerCase(Character.toUpperCase(int))} and
> + * {@code Character.toUpperCase(int)} return a BMP code-point,
> never a SMP code-point.</p>
> + * <p>Supporting points:</p>
> + * <ul>
> + * <li>This is expected due to related code-point
> grouping.</li>
> + * <li>Most scripts affected by letter case were added
> early in Unicode's history, and therefore
> + * placed in the BMP. (As of Unicode 13.0.0, there are
> 2,349 case-sensitive code-points in the
> + * BMP, but only 450 in the SMP.)
> + * </li>
> + * <li>If this property were ever invalidated, every
> Unicode case-insensitive equality test
> + * which requires equal-length input sequences (measured
> in {@code byte}s or {@code char}s)
> + * would break. In other words, this property is already
> utilized more-or-less universally by
> + * Unicode case-insensitive equality tests, whether or not
> their authors realize it. The
> + * Unicode Consortium has likely been aware of, and
> sensitive to, this issue for years.
> + * </li>
> + * </ul>
> + * </li>
> + * </ol>
> + * <p>Validation of each property is performed automatically by unit
> tests in {@code CompareIC}
> + * ("test/jdk/java/lang/String/CompareIC.java") during normal,
> post-build JDK testing. Those unit tests
> + * methods are, respectively:
> + * </p>
> + * <ol>
> + * <li>{@code
> validatePremise_AllSMPCodePointCaseVariantsUseTheSameHighSurrogate}</li>
> + * <li>{@code
> validatePremise_AllSMPCodePointCaseVariantsAreSMPCodePoints}</li>
> + * <li>{@code
> validatePremise_AllBMPCodePointCaseVariantsAreBMPCodePoints}</li>
> + * </ol>
> + * <p>Those tests' Javadocs include discussion of the consequences
> should they ever fail.
> + * </p>
> + */
> + private static int compareToCI
> + (
> + final byte[] charsA, int charsAIndex,
> + final byte[] charsB, int charsBIndex, int totalChars
> + ){
> + // Assertions carried-over from the previous "regionMatchesCI"
> implementation, with additional
> + // negative value testing. (Are these assertions still necessary or
> desirable?)
> +
> + assert (charsAIndex | charsBIndex | totalChars) >= 0; // Ensure
> these parameters are non-negative.
> + assert charsAIndex + totalChars <= length(charsA);
> + assert charsBIndex + totalChars <= length(charsB);
> +
> + // Variable "priorExactlyEqualChar" (PEEC) is used by the following
> loop to store the "char" value read
> + // by its prior iteration, if the same "char" value was read from
> both inputs ("charsA" and "charsB").
> + // Otherwise, it is set to zero. PEEC's value is used only during the
> UTF-16 decoding attempt triggered
> + // by reading a pair of unequal low-surrogates (see Unicode empirical
> property no. 1). It removes from
> + // this algorithm all read-behind/-ahead operations, along with their
> extra tests & branches, and
> + // redundant array accesses.
> + // Note: PEEC can be zero for three reasons: (1) the loop is in
> its first iteration, (2) the prior
> + // iteration read code-point zero from both inputs, or (3) the prior
> iteration read different, but case-
> + // insensitively equal, Basic Multilingual Plane code-points. In all
> three cases, zero signifies absence
> + // of preceding, identical high-surrogates (as does any
> non-high-surrogate value). Thus, a zero value's
> + // cause is ambiguous, but its meaning is clear.
> +
> + int priorExactlyEqualChar = 0; // AKA "PEEC".
> +
> + while (--totalChars >= 0) {
> + int cA = getChar(charsA, charsAIndex++);
> + int cB = getChar(charsB, charsBIndex++);
> +
> + if (cA == cB) {
> +
> + // If the next iteration's "cA" and "cB" are different
> low-surrogate code-units, it will need this
> + // iteration's value of "cA"/"cB" (their presumed
> high-surrogate code-unit), so it is stored in
> + // "priorExactlyEqualChar" (PEEC). In all other cases, this
> caching of "cA"/"cB" in PEEC is a small
> + // waste of time. However, the time saved by eliminating the
> tests, branches and reads necessary to
> + // reacquire pairs of high-surrogates should compensate for
> many unnecessary writes to PEEC.
> +
> + priorExactlyEqualChar = cA;
> continue;
> }
> -
> - // Check for supplementary characters case
> - cp1 = codePointIncluding(value, cp1, k1, toffset, tlast);
> - if (cp1 < 0) {
> - k1++;
> - cp1 -= cp1;
> - }
> - cp2 = codePointIncluding(other, cp2, k2, ooffset, olast);
> - if (cp2 < 0) {
> - k2++;
> - cp2 -= cp2;
> - }
> -
> - int diff = compareCodePointCI(cp1, cp2);
> - if (diff != 0) {
> - return diff;
> - }
> - }
> - return tlen - olen;
> - }
> -
> - // Case insensitive comparison of two code points
> - private static int compareCodePointCI(int cp1, int cp2) {
> - // try converting both characters to uppercase.
> - // If the results match, then the comparison scan should
> - // continue.
> - cp1 = Character.toUpperCase(cp1);
> - cp2 = Character.toUpperCase(cp2);
> - if (cp1 != cp2) {
> - // Unfortunately, conversion to uppercase does not work
> properly
> - // for the Georgian alphabet, which has strange rules about
> case
> - // conversion. So we need to make one last check before
> - // exiting.
> - cp1 = Character.toLowerCase(cp1);
> - cp2 = Character.toLowerCase(cp2);
> - if (cp1 != cp2) {
> - return cp1 - cp2;
> +
> + // KNOWN: "cA" and "cB" are not precisely equal.
> +
> + // When "priorExactlyEqualChar" is a high-surrogate, and both
> "cA" and "cB" are low-surrogates, the
> + // following "if" block performs UTF-16 decoding, replacing the
> code-units in "cA" and "cB" with
> + // code-points. The "if" block then falls-through to the
> code-point case-insensitive comparison.
> + //
> + // If any UTF-16 decoding precondition is unmet, "cA" and "cB"
> will contain Basic Multilingual Plane
> + // (BMP) code-points suitable for case-insensitive comparison,
> unless UTF-16 encoding is corrupt in
> + // "charsA" and/or "charsB". The following cases are possible:
> + // * "priorExactlyEqualChar" is not a high-surrogate, so the
> "if" block is not entered.
> + // Possible states of "cA" and "cB":
> + // * "cA" and "cB" are unequal BMP code-points,
> potentially case-insensitively equal.
> + // * "cA" and "cB" are unequal high-surrogates.
> Case-insensitive equality testing inevitably
> + // fails for those code-units, terminating this method.
> Low-surrogate retrieval and UTF-16
> + // decoding are unnecessary due to Unicode empirical
> property no. 1.
> + // * "cA" and "cB" are unequal low-surrogates. UTF-16
> encoding is corrupt in both "charsA"
> + // and "charsB" (low-surrogates not preceded by
> high-surrogates). Case-insensitive equality
> + // testing inevitably fails for those code-units,
> terminating this method.
> + // * Either "cA" or "cB" is a low-surrogate, while the
> other is a high-surrogate or BMP code-
> + // point. UTF-16 encoding is corrupt in the array
> supplying the low-surrogate (no preceding
> + // high-surrogate). Case-insensitive equality testing
> inevitably fails, terminating this
> + // method.
> + // * "priorExactlyEqualChar" is a high-surrogate present in
> both "charsA" and "charsB".
> + // Possible states of "cA" and "cB":
> + // * "cA" and "cB" are unequal low-surrogates. The "if"
> block is entered, converting "cA"
> + // and "cB" into Supplementary Multilingual Plane (SMP)
> code-points, potentially case-
> + // insensitively equal.
> + // * "cA" and "cB" are unequal high-surrogates. UTF-16
> encoding is corrupt in both "charsA"
> + // and "charsB" (high-surrogates following
> high-surrogates). Even if the next loop
> + // iteration would retrieve low-surrogates, Unicode
> empirical property no. 1 rules-out
> + // case-insensitive equality based on the unequal
> high-surrogates alone.
> + // * "cA" and "cB" are unequal BMP code-points,
> potentially case-insensitively equal. UTF-16
> + // encoding is corrupt in both "charsA" and "charsB"
> (high-surrogates not followed by low-
> + // surrogates), but the corruption is ignored because
> the high-surrogates were identical
> + // (equality exists thus far, corrupt encoding
> notwithstanding).
> + // * Either "cA" or "cB" is a low-surrogate, while the
> other is a high-surrogate or BMP code-
> + // point. UTF-16 encoding is corrupt in the array
> supplying the non-low-surrogate (high-
> + // surrogate not followed by low-surrogate). UTF-16
> decoding using the lone low-surrogate
> + // is unnecessary, because either:
> + // 1. The non-low-surrogate is a high-surrogate,
> ruling-out case-insensitive equality.
> + // (Two independent rationales: [1] A
> high-surrogate is not a low-surrogate, by
> + // definition. Also by definition, [2] an SMP
> code-point--decoded from high-surrogate
> + // "priorExactlyEqualChar" and the lone
> low-surrogate--is not a high-surrogate code-
> + // unit.)
> + // 2. The non-low-surrogate is a BMP code-point.
> Unicode empirical properties 2 and 3
> + // rule-out case-insensitive equality between SMP
> and BMP code-points.
> + // (Alternately, if those properties did not
> exist, deciding equality exists based
> + // on code-points at different positions within
> the compared string sections--whose
> + // comparison requires overlooking an inconvenient
> preceding code-unit in one string--
> + // would be logically dubious on multiple grounds.
> For instance, if one preceding
> + // orphan surrogate can be ignored to prove
> equality, is the same true for two or more
> + // orphans consecutively and/or separately? If so,
> existing code breaks. Notably, most
> + // string equality tests would break due to
> invalidation of their equal-length strings
> + // precondition. [A precondition also invalidated
> if the Unicode Consortium ever
> + // asserts case-insensitive equality between a BMP
> and SMP code-point. Thus, Unicode
> + // empirical properties 2 and 3 will likely
> persist.])
> + //
> + // (Aside: Of the following three tests, only the test of
> "priorExactlyEqualChar" would be necessary
> + // if Unicode [non-compact] "String" objects guaranteed valid
> UTF-16 encoding. It seems certain
> + // enforcing that guarantee during creation of every "String" [or
> string-like object] is infeasible
> + // due to runtime cost, and backwards-incompatibility.
> Nonetheless, this writer [CWJ] is tantalized
> + // by the many code simplifications guaranteed-correct "String"
> objects would permit, as well as the
> + // clues they could confer on some developers [including this
> writer, once upon a time].)
> +
> + // Branch Reduction
> + //
> + // The "if" statement used below has two tests/branches, instead
> of three, due to combining two
> + // low-surrogate tests into one. Multiple comparative benchmarks
> (each using 300 random data sets
> + // of 200,000 comparisons, repeated 800 times [160 million
> comparisons per data set; 48 billion
> + // total comparisons]), run on two different OS & hardware
> combinations, yielded respective best-
> + // case improvements of 1.24% and 0.98%, and mean improvements of
> 1.79% and 0.62%, relative to
> + // three-branch "if" statement:
> + //
> + // if (priorExactlyEqualChar >>> SURROGATE_PAYLOAD_BIT_COUNT
> == SURROGATE_HI_HEADER_RIGHT_SHIFTED_10
> + // && cA >>> SURROGATE_PAYLOAD_BIT_COUNT
> == SURROGATE_LO_HEADER_RIGHT_SHIFTED_10
> + // && cB >>> SURROGATE_PAYLOAD_BIT_COUNT
> == SURROGATE_LO_HEADER_RIGHT_SHIFTED_10)
> + //
> + // Conversely, combining all three tests into one yields worse
> performance than the three-test
> + // "if" shown above, because high-surrogate testing of
> "priorExactlyEqualChar" fails frequently
> + // (50% minimum failure rate for well-formed UTF-16), and, when
> it fails, low-surrogate testing
> + // of "cA" and "cB" wastes time.
> +
> + if (priorExactlyEqualChar >>> SURROGATE_PAYLOAD_BIT_COUNT ==
> SURROGATE_HI_HEADER_RIGHT_SHIFTED_10
> + && (cA ^ Character.MIN_LOW_SURROGATE | cB ^
> Character.MIN_LOW_SURROGATE) >>> SURROGATE_PAYLOAD_BIT_COUNT == 0
> + ){
> + // KNOWN: "cA" and "cB" are not precisely equal.
> + // KNOWN: "priorExactlyEqualChar" is a high-surrogate
> code-unit.
> + // KNOWN: "cA" and "cB" are low-surrogate code-units. (Bits
> 15 to 10 of both are 110111.)
> +
> + // Perform UTF-16 decoding, accelerated by repurposing
> "priorExactlyEqualChar" (PEEC) to store
> + // the result of decoding operations common to both "cA" and
> "cB" (because they share the high-
> + // surrogate in PEEC). Near-total commonality of operations
> enables decoding two code-points
> + // using only one more addition operation than decoding a
> single code-point. (PEEC's repurposing
> + // is brief; it is always set to zero immediately after this
> "if" block.)
> +
> + priorExactlyEqualChar = (priorExactlyEqualChar <<
> SURROGATE_PAYLOAD_BIT_COUNT) + SURROGATE_COMBO_TO_CODE_POINT_DELTA;
> +
> + cA += priorExactlyEqualChar;
> + cB += priorExactlyEqualChar;
> }
> +
> + // "cA" and "cB" are not precisely equal, therefore set
> "priorExactlyEqualChar" (PEEC) to zero for
> + // the next iteration's potential benefit. ("Potential benefit,"
> because the next iteration, if any,
> + // uses PEEC's value only when its "cA" and "cB" are different
> low-surrogate code-units.)
> +
> + priorExactlyEqualChar = 0;
> +
> + // Having performed UTF-16 decoding (if necessary and possible),
> compare case-insensitively code-
> + // points "cA" and "cB".
> + //
> + // Note: Code-units are also possible. Normal terminating
> conditions include unequal high-surrogates
> + // (see Unicode empirical property no. 1), and a
> high-surrogate/code-point combination. An
> + // abnormal terminating condition is a low-surrogate and a
> code-point, meaning at least one
> + // input string's UTF-16 encoding is corrupt (a
> low-surrogate not preceded by a high-surrogate,
> + // or matching high-surrogates not followed--in one
> string--by a low-surrogate). None of those
> + // conditions undermines the code below, because
> "compareCodePointsIgnoringCase" evaluates
> + // "caseless" inputs unchanged, including code-units.
> +
> + cA = compareCodePointsIgnoringCase(cA, cB);
> + if (cA != 0)
> + return cA;
> }
> +
> return 0;
> }
> -
> - // Returns a code point from the code unit pointed by "index". If it is
> - // not a surrogate or an unpaired surrogate, then the code unit is
> - // returned as is. Otherwise, it is combined with the code unit before
> - // or after, depending on the type of the surrogate at index, to make a
> - // supplementary code point. The return value will be negated if the
> code
> - // unit pointed by index is a high surrogate, and index + 1 is a low
> surrogate.
> - private static int codePointIncluding(byte[] ba, int cp, int index,
> int start, int end) {
> - // fast check
> - if (!Character.isSurrogate((char)cp)) {
> - return cp;
> - }
> - if (Character.isLowSurrogate((char)cp)) {
> - if (index > start) {
> - char c = getChar(ba, index - 1);
> - if (Character.isHighSurrogate(c)) {
> - return Character.toCodePoint(c, (char)cp);
> - }
> - }
> - } else if (index + 1 < end) { // cp == high surrogate
> - char c = getChar(ba, index + 1);
> - if (Character.isLowSurrogate(c)) {
> - // negate the code point
> - return - Character.toCodePoint((char)cp, c);
> - }
> - }
> - return cp;
> +
> + /**
> + * Gets the case-insensitive, lexicographical relationship between
> code-points.
> + * Though not required, passing this method only unequal code-points
> ("{@code
> + * codePointA != codePointB}") is most efficient.
> + *
> + * @param codePointA first code-point
> + * @param codePointB second code-point
> + *
> + * @return "{@code codePointA - codePointB}" after both undergo
> case-conversion equivalent
> + * to, but faster than, {@code
> Character.toLowerCase(Character.toUpperCase(codePoint))}
> + *
> + * @implNote This method uses two, duplicate {@code switch}
> expressions, because
> + * benchmarking showed a noticeable performance improvement over a
> single {@code
> + * switch} wrapped in the most minimal of two-iteration loops. The
> duplication
> + * should create no maintenance problems, because, as detailed by a
> comment within
> + * the method, the {@code switch} source code is programmatically
> generated when a
> + * "CompareIC" unit test detects a mishandled "triple-case"
> code-point. Otherwise,
> + * modifying these expressions is unnecessary. (The unit test does not
> alter this,
> + * or any, file; it merely includes updated {@code switch} source in
> its error
> + * message, leaving a human responsible for pasting it here, twice.)
> + */
> + static int compareCodePointsIgnoringCase
> + (
> + final int codePointA,
> + final int codePointB
> + ){
> + /* A code-point is here dubbed "triple-case" if "codePoint !=
> Character.toUpperCase(codePoint) &&
> + codePoint !=
> Character.toLowerCase(Character.toUpperCase(codePoint))". Such code-points
> are rare
> + (see "switch" expressions below), but force most code-point
> comparisons to invoke both "toUpperCase"
> + and "toLowerCase". If "triple-case" code-points did not exist,
> invoking only "toLowerCase" would
> + suffice.
> +
> + This method comes close to realizing the "no triple-case
> code-points" scenario by directly converting
> + "triple-case" code-points to their final, lowercase forms using
> "switch" expressions, and invoking
> + only "toLowerCase" on all other code-points. The effect is a
> benchmark performance gain of 24.3%.
> +
> + Note: The Georgian alphabet includes no "triple-case" code-points,
> as shown below by the "progression"
> + columns of the "switch" expressions' comments. (As of
> Unicode 13.0.0, there are three Georgian
> + blocks: "Georgian" 10A0-10FF, "Georgian Extended" 1C90-1CBF,
> and "Georgian Supplement" 2D00-2D2F.)
> +
> +
> + TESTING AND CODE GENERATION
> +
> + A unit test in "CompareIC"
> ("test/jdk/java/lang/String/CompareIC.java") searches all 1,114,112 Unicode
> + code-points/-units for triple-case code-points, then validates
> their handling by "String.equalsIgnoreCase"
> + and "String.compareToIgnoreCase". If errors are detected, those
> tests fail with a message detailing each
> + error and supplying freshly generated source code for the "switch"
> expression used twice below. (If the
> + new and existing set of "case" clauses differ, pasting the former
> over the latter will fix the problem.
> + If "case" clauses are identical, the problem lies elsewhere.)
> + */
> + // Triple-Case Code-Points, as
> of Java 15.0.2. (Written by
> "java.lang.CompareIC.generateTripleCaseCodePointsSwitchExpression".)
> + //
> + // Code-Point Progression
> Code-Point Progression by Name
> + return switch (codePointA) { // --------------------------
>
> -------------------------------------------------------------------------------------------------------------------------------------
> + case '\u00B5' -> '\u03BC'; // U+00B5 -> U+039C -> U+03BC
> MICRO SIGN -> GREEK CAPITAL
> LETTER MU -> GREEK SMALL LETTER MU
> + case '\u0131' -> '\u0069'; // U+0131 -> U+0049 -> U+0069
> LATIN SMALL LETTER DOTLESS I -> LATIN CAPITAL
> LETTER I -> LATIN SMALL LETTER I
> + case '\u017F' -> '\u0073'; // U+017F -> U+0053 -> U+0073
> LATIN SMALL LETTER LONG S -> LATIN CAPITAL
> LETTER S -> LATIN SMALL LETTER S
> + case '\u01C5' -> '\u01C6'; // U+01C5 -> U+01C4 -> U+01C6
> LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON -> LATIN CAPITAL
> LETTER DZ WITH CARON -> LATIN SMALL LETTER DZ WITH CARON
> + case '\u01C8' -> '\u01C9'; // U+01C8 -> U+01C7 -> U+01C9
> LATIN CAPITAL LETTER L WITH SMALL LETTER J -> LATIN CAPITAL
> LETTER LJ -> LATIN SMALL LETTER LJ
> + case '\u01CB' -> '\u01CC'; // U+01CB -> U+01CA -> U+01CC
> LATIN CAPITAL LETTER N WITH SMALL LETTER J -> LATIN CAPITAL
> LETTER NJ -> LATIN SMALL LETTER NJ
> + case '\u01F2' -> '\u01F3'; // U+01F2 -> U+01F1 -> U+01F3
> LATIN CAPITAL LETTER D WITH SMALL LETTER Z -> LATIN CAPITAL
> LETTER DZ -> LATIN SMALL LETTER DZ
> + case '\u0345' -> '\u03B9'; // U+0345 -> U+0399 -> U+03B9
> COMBINING GREEK YPOGEGRAMMENI -> GREEK CAPITAL
> LETTER IOTA -> GREEK SMALL LETTER IOTA
> + case '\u03C2' -> '\u03C3'; // U+03C2 -> U+03A3 -> U+03C3
> GREEK SMALL LETTER FINAL SIGMA -> GREEK CAPITAL
> LETTER SIGMA -> GREEK SMALL LETTER SIGMA
> + case '\u03D0' -> '\u03B2'; // U+03D0 -> U+0392 -> U+03B2
> GREEK BETA SYMBOL -> GREEK CAPITAL
> LETTER BETA -> GREEK SMALL LETTER BETA
> + case '\u03D1' -> '\u03B8'; // U+03D1 -> U+0398 -> U+03B8
> GREEK THETA SYMBOL -> GREEK CAPITAL
> LETTER THETA -> GREEK SMALL LETTER THETA
> + case '\u03D5' -> '\u03C6'; // U+03D5 -> U+03A6 -> U+03C6
> GREEK PHI SYMBOL -> GREEK CAPITAL
> LETTER PHI -> GREEK SMALL LETTER PHI
> + case '\u03D6' -> '\u03C0'; // U+03D6 -> U+03A0 -> U+03C0
> GREEK PI SYMBOL -> GREEK CAPITAL
> LETTER PI -> GREEK SMALL LETTER PI
> + case '\u03F0' -> '\u03BA'; // U+03F0 -> U+039A -> U+03BA
> GREEK KAPPA SYMBOL -> GREEK CAPITAL
> LETTER KAPPA -> GREEK SMALL LETTER KAPPA
> + case '\u03F1' -> '\u03C1'; // U+03F1 -> U+03A1 -> U+03C1
> GREEK RHO SYMBOL -> GREEK CAPITAL
> LETTER RHO -> GREEK SMALL LETTER RHO
> + case '\u03F5' -> '\u03B5'; // U+03F5 -> U+0395 -> U+03B5
> GREEK LUNATE EPSILON SYMBOL -> GREEK CAPITAL
> LETTER EPSILON -> GREEK SMALL LETTER EPSILON
> + case '\u1C80' -> '\u0432'; // U+1C80 -> U+0412 -> U+0432
> CYRILLIC SMALL LETTER ROUNDED VE -> CYRILLIC
> CAPITAL LETTER VE -> CYRILLIC SMALL LETTER VE
> + case '\u1C81' -> '\u0434'; // U+1C81 -> U+0414 -> U+0434
> CYRILLIC SMALL LETTER LONG-LEGGED DE -> CYRILLIC
> CAPITAL LETTER DE -> CYRILLIC SMALL LETTER DE
> + case '\u1C82' -> '\u043E'; // U+1C82 -> U+041E -> U+043E
> CYRILLIC SMALL LETTER NARROW O -> CYRILLIC
> CAPITAL LETTER O -> CYRILLIC SMALL LETTER O
> + case '\u1C83' -> '\u0441'; // U+1C83 -> U+0421 -> U+0441
> CYRILLIC SMALL LETTER WIDE ES -> CYRILLIC
> CAPITAL LETTER ES -> CYRILLIC SMALL LETTER ES
> + case '\u1C84' -> '\u0442'; // U+1C84 -> U+0422 -> U+0442
> CYRILLIC SMALL LETTER TALL TE -> CYRILLIC
> CAPITAL LETTER TE -> CYRILLIC SMALL LETTER TE
> + case '\u1C85' -> '\u0442'; // U+1C85 -> U+0422 -> U+0442
> CYRILLIC SMALL LETTER THREE-LEGGED TE -> CYRILLIC
> CAPITAL LETTER TE -> CYRILLIC SMALL LETTER TE
> + case '\u1C86' -> '\u044A'; // U+1C86 -> U+042A -> U+044A
> CYRILLIC SMALL LETTER TALL HARD SIGN -> CYRILLIC
> CAPITAL LETTER HARD SIGN -> CYRILLIC SMALL LETTER HARD SIGN
> + case '\u1C87' -> '\u0463'; // U+1C87 -> U+0462 -> U+0463
> CYRILLIC SMALL LETTER TALL YAT -> CYRILLIC
> CAPITAL LETTER YAT -> CYRILLIC SMALL LETTER YAT
> + case '\u1C88' -> '\uA64B'; // U+1C88 -> U+A64A -> U+A64B
> CYRILLIC SMALL LETTER UNBLENDED UK -> CYRILLIC
> CAPITAL LETTER MONOGRAPH UK -> CYRILLIC SMALL LETTER MONOGRAPH UK
> + case '\u1E9B' -> '\u1E61'; // U+1E9B -> U+1E60 -> U+1E61
> LATIN SMALL LETTER LONG S WITH DOT ABOVE -> LATIN CAPITAL
> LETTER S WITH DOT ABOVE -> LATIN SMALL LETTER S WITH DOT ABOVE
> + case '\u1FBE' -> '\u03B9'; // U+1FBE -> U+0399 -> U+03B9
> GREEK PROSGEGRAMMENI -> GREEK CAPITAL
> LETTER IOTA -> GREEK SMALL LETTER IOTA
> + // --------------------------
>
> -------------------------------------------------------------------------------------------------------------------------------------
> + // All other case-sensitive code-points are either uppercase (in
> which case they are changed
> + // below), or lowercase already (in which case they are not).
> Therefore, only "toLowerCase"
> + // is necessary. Code-units and case-insensitive code-points are
> unchanged by "toLowerCase".
> + default -> Character.toLowerCase(codePointA);
> +
> + } - switch (codePointB) { // --------------------------
>
> -------------------------------------------------------------------------------------------------------------------------------------
> + case '\u00B5' -> '\u03BC'; // U+00B5 -> U+039C -> U+03BC
> MICRO SIGN -> GREEK CAPITAL
> LETTER MU -> GREEK SMALL LETTER MU
> + case '\u0131' -> '\u0069'; // U+0131 -> U+0049 -> U+0069
> LATIN SMALL LETTER DOTLESS I -> LATIN CAPITAL
> LETTER I -> LATIN SMALL LETTER I
> + case '\u017F' -> '\u0073'; // U+017F -> U+0053 -> U+0073
> LATIN SMALL LETTER LONG S -> LATIN CAPITAL
> LETTER S -> LATIN SMALL LETTER S
> + case '\u01C5' -> '\u01C6'; // U+01C5 -> U+01C4 -> U+01C6
> LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON -> LATIN CAPITAL
> LETTER DZ WITH CARON -> LATIN SMALL LETTER DZ WITH CARON
> + case '\u01C8' -> '\u01C9'; // U+01C8 -> U+01C7 -> U+01C9
> LATIN CAPITAL LETTER L WITH SMALL LETTER J -> LATIN CAPITAL
> LETTER LJ -> LATIN SMALL LETTER LJ
> + case '\u01CB' -> '\u01CC'; // U+01CB -> U+01CA -> U+01CC
> LATIN CAPITAL LETTER N WITH SMALL LETTER J -> LATIN CAPITAL
> LETTER NJ -> LATIN SMALL LETTER NJ
> + case '\u01F2' -> '\u01F3'; // U+01F2 -> U+01F1 -> U+01F3
> LATIN CAPITAL LETTER D WITH SMALL LETTER Z -> LATIN CAPITAL
> LETTER DZ -> LATIN SMALL LETTER DZ
> + case '\u0345' -> '\u03B9'; // U+0345 -> U+0399 -> U+03B9
> COMBINING GREEK YPOGEGRAMMENI -> GREEK CAPITAL
> LETTER IOTA -> GREEK SMALL LETTER IOTA
> + case '\u03C2' -> '\u03C3'; // U+03C2 -> U+03A3 -> U+03C3
> GREEK SMALL LETTER FINAL SIGMA -> GREEK CAPITAL
> LETTER SIGMA -> GREEK SMALL LETTER SIGMA
> + case '\u03D0' -> '\u03B2'; // U+03D0 -> U+0392 -> U+03B2
> GREEK BETA SYMBOL -> GREEK CAPITAL
> LETTER BETA -> GREEK SMALL LETTER BETA
> + case '\u03D1' -> '\u03B8'; // U+03D1 -> U+0398 -> U+03B8
> GREEK THETA SYMBOL -> GREEK CAPITAL
> LETTER THETA -> GREEK SMALL LETTER THETA
> + case '\u03D5' -> '\u03C6'; // U+03D5 -> U+03A6 -> U+03C6
> GREEK PHI SYMBOL -> GREEK CAPITAL
> LETTER PHI -> GREEK SMALL LETTER PHI
> + case '\u03D6' -> '\u03C0'; // U+03D6 -> U+03A0 -> U+03C0
> GREEK PI SYMBOL -> GREEK CAPITAL
> LETTER PI -> GREEK SMALL LETTER PI
> + case '\u03F0' -> '\u03BA'; // U+03F0 -> U+039A -> U+03BA
> GREEK KAPPA SYMBOL -> GREEK CAPITAL
> LETTER KAPPA -> GREEK SMALL LETTER KAPPA
> + case '\u03F1' -> '\u03C1'; // U+03F1 -> U+03A1 -> U+03C1
> GREEK RHO SYMBOL -> GREEK CAPITAL
> LETTER RHO -> GREEK SMALL LETTER RHO
> + case '\u03F5' -> '\u03B5'; // U+03F5 -> U+0395 -> U+03B5
> GREEK LUNATE EPSILON SYMBOL -> GREEK CAPITAL
> LETTER EPSILON -> GREEK SMALL LETTER EPSILON
> + case '\u1C80' -> '\u0432'; // U+1C80 -> U+0412 -> U+0432
> CYRILLIC SMALL LETTER ROUNDED VE -> CYRILLIC
> CAPITAL LETTER VE -> CYRILLIC SMALL LETTER VE
> + case '\u1C81' -> '\u0434'; // U+1C81 -> U+0414 -> U+0434
> CYRILLIC SMALL LETTER LONG-LEGGED DE -> CYRILLIC
> CAPITAL LETTER DE -> CYRILLIC SMALL LETTER DE
> + case '\u1C82' -> '\u043E'; // U+1C82 -> U+041E -> U+043E
> CYRILLIC SMALL LETTER NARROW O -> CYRILLIC
> CAPITAL LETTER O -> CYRILLIC SMALL LETTER O
> + case '\u1C83' -> '\u0441'; // U+1C83 -> U+0421 -> U+0441
> CYRILLIC SMALL LETTER WIDE ES -> CYRILLIC
> CAPITAL LETTER ES -> CYRILLIC SMALL LETTER ES
> + case '\u1C84' -> '\u0442'; // U+1C84 -> U+0422 -> U+0442
> CYRILLIC SMALL LETTER TALL TE -> CYRILLIC
> CAPITAL LETTER TE -> CYRILLIC SMALL LETTER TE
> + case '\u1C85' -> '\u0442'; // U+1C85 -> U+0422 -> U+0442
> CYRILLIC SMALL LETTER THREE-LEGGED TE -> CYRILLIC
> CAPITAL LETTER TE -> CYRILLIC SMALL LETTER TE
> + case '\u1C86' -> '\u044A'; // U+1C86 -> U+042A -> U+044A
> CYRILLIC SMALL LETTER TALL HARD SIGN -> CYRILLIC
> CAPITAL LETTER HARD SIGN -> CYRILLIC SMALL LETTER HARD SIGN
> + case '\u1C87' -> '\u0463'; // U+1C87 -> U+0462 -> U+0463
> CYRILLIC SMALL LETTER TALL YAT -> CYRILLIC
> CAPITAL LETTER YAT -> CYRILLIC SMALL LETTER YAT
> + case '\u1C88' -> '\uA64B'; // U+1C88 -> U+A64A -> U+A64B
> CYRILLIC SMALL LETTER UNBLENDED UK -> CYRILLIC
> CAPITAL LETTER MONOGRAPH UK -> CYRILLIC SMALL LETTER MONOGRAPH UK
> + case '\u1E9B' -> '\u1E61'; // U+1E9B -> U+1E60 -> U+1E61
> LATIN SMALL LETTER LONG S WITH DOT ABOVE -> LATIN CAPITAL
> LETTER S WITH DOT ABOVE -> LATIN SMALL LETTER S WITH DOT ABOVE
> + case '\u1FBE' -> '\u03B9'; // U+1FBE -> U+0399 -> U+03B9
> GREEK PROSGEGRAMMENI -> GREEK CAPITAL
> LETTER IOTA -> GREEK SMALL LETTER IOTA
> + // --------------------------
>
> -------------------------------------------------------------------------------------------------------------------------------------
> + // All other case-sensitive code-points are either uppercase (in
> which case they are changed
> + // below), or lowercase already (in which case they are not).
> Therefore, only "toLowerCase"
> + // is necessary. Code-units and case-insensitive code-points are
> unchanged by "toLowerCase".
> + default -> Character.toLowerCase(codePointB);
> + };
> }
>
> public static int compareToCI_Latin1(byte[] value, byte[] other) {
> @@ -780,10 +1141,35 @@
> }
> return new String(result, UTF16);
> }
> -
> - public static boolean regionMatchesCI(byte[] value, int toffset,
> - byte[] other, int ooffset, int
> len) {
> - return compareToCIImpl(value, toffset, len, other, ooffset, len)
> == 0;
> +
> + /**
> + * Tests case-insensitive equality of UTF-16 sequences in {@code byte}
> array subsections.
> + * Performs {@linkplain #compareCodePointsIgnoringCase case
> conversions equivalent to
> + *
> <code>Character.toLowerCase(Character.toUpperCase(codePoint))</code>}
> before comparing
> + * unequal code-points.
> + *
> + * @param charsA array of byte pairs representing 16-bit ({@code
> char}) quantities,
> + * with byte order mandated by {@code
> StringUTF16.isBigEndian()}
> + * @param charsAIndex index, in 16-bit quantities, at which comparison
> of "{@code charsA}"
> + * begins. Caller must enforce constraints "{@code
> 0 <= charsAIndex}"
> + * and "{@code 0 <= charsAIndex + totalChars <=
> charsA.length / 2}".
> + * @param charsB array of byte pairs representing 16-bit ({@code
> char}) quantities,
> + * with byte order mandated by {@code
> StringUTF16.isBigEndian()}
> + * @param charsBIndex index, in 16-bit quantities, at which comparison
> of "{@code charsB}"
> + * begins. Caller must enforce constraints "{@code
> 0 <= charsBIndex}"
> + * and "{@code 0 <= charsBIndex + totalChars <=
> charsB.length / 2}".
> + * @param totalChars maximum number of {@code char}s to compare.
> Caller must enforce
> + * constraint "{@code 0 <= totalChars}".
> + *
> + * @return {@code true} if the tested portions of {@code charsA} and
> {@code charsB} are
> + * case-insensitively equal, otherwise {@code false}
> + */
> + static boolean regionMatchesCI
> + (
> + final byte[] charsA, final int charsAIndex,
> + final byte[] charsB, final int charsBIndex, final int totalChars
> + ){
> + return compareToCI(charsA, charsAIndex, charsB, charsBIndex,
> totalChars) == 0;
> }
>
> public static boolean regionMatchesCI_Latin1(byte[] value, int toffset,
>
More information about the core-libs-dev
mailing list