Question: Is it possible to introduce ASCII coder for compact strings to optimize performance for ASCII compatible encoding?

Thu May 19 13:40:09 UTC 2022

Yes, I think 20% is close to an upper bound, with less gain on smaller
strings due improved cache-locality. And as you say such a small
speed-up might not be very noticeable in practice when I/O is involved.

/Claes

On 2022-05-19 14:17, Glavo wrote:
> Thank you very much for your work, it seems that the decoding speed of 
> the new version of JDK has been greatly improved.
> 
> However, `hasNegatives` still has a cost on my device. I tried removing 
> the call to hasNegatives in `String.encodeUTF8` based on JDK 17.0.3,
> then I tested its performance through JMH. I ran the JMH benchmark on 
> two of my x86 machines (The CPU of one of the machines is Ryzen 7 3700X 
> and the other is Xeon W-2175)
> and got about a 20% performance boost on both machines.
> 
> Of course, The decoding performance in the new JDK is better than I 
> thought. For an ASCII string of length 10000, hasNegatives takes about 
> 150 nanoseconds, and the duration increases linearly with the length.
> Considering that these operations usually accompany IO, the overhead is 
> really not as noticeable as I thought.
> 
> Thank you again for your contribution to OpenJDK and your patience in 
> replying.
> 
> On Wed, May 18, 2022 at 10:47 PM Claes Redestad 
> <claes.redestad at oracle.com <mailto:claes.redestad at oracle.com>> wrote:
> 
>     Hi,
> 
>     so your suggestion is that String::coder would be ASCII if all
>     codepoints are 0-127, then LATIN1 if it contains at least one char in
>     the 128-255 range, otherwise UTF16? As you say this would mean we could
>     skip scanning StringCoding::hasNegatives scans and similar overheads
>     when converting known-to-be-ASCII Strings to UTF-8 and other ASCII
>     compatible encoding. But it might also complicate logic in various
>     places, so we need a clear win to motivate such work.
> 
>     So the first question to answer might be how much does the hasNegatives
>     calls cost us. Depending on your hardware (and JDK version) the answer
>     might be "almost nothing"!
> 
>     I did some work last year to speed up encoding and decoding ASCII-only
>     data to/from various encodings. When I was done there was no longer much
>     of a difference when encoding from String to an ASCII-compatible charset
>     [1]. Maybe a scaling ~10% advantage for latin-1 in some microbenchmark
>     when decoding on really large strings[2].
> 
>     But with the suggested change the decoding advantage would likely
>     disappear: we maintain the invariant that Strings are equals if and only
>     if their coders are equal, so no we'd now have to scan latin-1 encoded
>     streams for non-ASCII bytes to disambiguate between ASCII and LATIN1.
> 
>     Maybe there are some other case that would be helped, but with some
>     likely negatives and only a modest potential win then my gut feeling is
>     that this wouldn't pay off.
> 
>     Best regards
>     Claes
> 
>     [1] https://cl4es.github.io/2021/10/17/Faster-Charset-Encoding.html
>     <https://urldefense.com/v3/__https://cl4es.github.io/2021/10/17/Faster-Charset-Encoding.html__;!!ACWV5N9M2RV99hQ!I_AH1iAOysINPqrZ3OaqijBHyoZwpzC_5jUnQk-lTGa8-V3RL6CnY8WP3t__jWfDKrUylGMS6E5NEBW7JwNo$>
>     [2] https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html
>     <https://urldefense.com/v3/__https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html__;!!ACWV5N9M2RV99hQ!I_AH1iAOysINPqrZ3OaqijBHyoZwpzC_5jUnQk-lTGa8-V3RL6CnY8WP3t__jWfDKrUylGMS6E5NEDHadwAE$>
> 
>     (All the improvements discussed in the blog entires should be available
>     since 17.0.2.)
> 
>     On 2022-05-18 14:07, Glavo wrote:
>      > After the introduction of compact strings in Java 9, the current
>     String may
>      > store byte arrays encoded as Latin1 or UTF-16.
>      > Here's a troublesome thing: Latin1 is not compatible with UTF-8.
>     Not all
>      > Latin1 byte strings are legal UTF-8 byte strings:
>      > They are only compatible within the ASCII range, when there are
>     code points
>      > greater than 127, Latin1 uses a one-byte
>      > representation, while UTF-8 requires two bytes.
>      >
>      > As an example, every time `JavaLangAccess::getBytesNoRepl` is
>     called to
>      > convert a string to a UTF-8 array,
>      > the internal implementation needs to call
>     `StringCoding.hasNegatives` to
>      > scan the content byte array to determine that
>      > the string can be encoded in ASCII, and thus eliminate array copies.
>      > Similar steps are performed when calling `str.getBytes(UTF_8)`.
>      >
>      > So, is it possible to introduce a third possible value for
>     `String::coder`:
>      > ASCII?
>      > This looks like an attractive option, and if we do this, we can
>     introduce
>      > fast paths for many methods.
>      >
>      > Of course, I know that this change is not completely free, and
>     some methods
>      > may bring slight performance
>      > degradation due to the need to judge the coder, in particular,
>     there may be
>      > an impact on the performance of the
>      > StringBuilder/StringBuffer. However, given that UTF-8 is by far
>     the most
>      > commonly used file encoding, the performance
>      > benefits of fast paths are likely to cover more scenarios. In
>     addition to
>      > this, other ASCII compatible encodings may
>      > also benefit, such as GB18030(GBK), or ISO 8859 variants. And if
>     frozen
>      > arrays were introduced into the JDK,
>      > there would be more scenarios to enjoy performance improvements.
>      >
>      > So I would like to ask, is it possible for JDK to improve String
>     storage in
>      > a similar way in the future?
>      > Has anyone explored this issue before?
>      >
>      > Sorry to bother you all, but I'm very much looking forward to the
>     answer to
>      > this question.
>