Question: Is it possible to introduce ASCII coder for compact strings to optimize performance for ASCII compatible encoding?
Claes Redestad
claes.redestad at oracle.com
Thu May 19 13:40:09 UTC 2022
Yes, I think 20% is close to an upper bound, with less gain on smaller
strings due improved cache-locality. And as you say such a small
speed-up might not be very noticeable in practice when I/O is involved.
/Claes
On 2022-05-19 14:17, Glavo wrote:
> Thank you very much for your work, it seems that the decoding speed of
> the new version of JDK has been greatly improved.
>
> However, `hasNegatives` still has a cost on my device. I tried removing
> the call to hasNegatives in `String.encodeUTF8` based on JDK 17.0.3,
> then I tested its performance through JMH. I ran the JMH benchmark on
> two of my x86 machines (The CPU of one of the machines is Ryzen 7 3700X
> and the other is Xeon W-2175)
> and got about a 20% performance boost on both machines.
>
> Of course, The decoding performance in the new JDK is better than I
> thought. For an ASCII string of length 10000, hasNegatives takes about
> 150 nanoseconds, and the duration increases linearly with the length.
> Considering that these operations usually accompany IO, the overhead is
> really not as noticeable as I thought.
>
> Thank you again for your contribution to OpenJDK and your patience in
> replying.
>
> On Wed, May 18, 2022 at 10:47 PM Claes Redestad
> <claes.redestad at oracle.com <mailto:claes.redestad at oracle.com>> wrote:
>
> Hi,
>
> so your suggestion is that String::coder would be ASCII if all
> codepoints are 0-127, then LATIN1 if it contains at least one char in
> the 128-255 range, otherwise UTF16? As you say this would mean we could
> skip scanning StringCoding::hasNegatives scans and similar overheads
> when converting known-to-be-ASCII Strings to UTF-8 and other ASCII
> compatible encoding. But it might also complicate logic in various
> places, so we need a clear win to motivate such work.
>
> So the first question to answer might be how much does the hasNegatives
> calls cost us. Depending on your hardware (and JDK version) the answer
> might be "almost nothing"!
>
> I did some work last year to speed up encoding and decoding ASCII-only
> data to/from various encodings. When I was done there was no longer much
> of a difference when encoding from String to an ASCII-compatible charset
> [1]. Maybe a scaling ~10% advantage for latin-1 in some microbenchmark
> when decoding on really large strings[2].
>
> But with the suggested change the decoding advantage would likely
> disappear: we maintain the invariant that Strings are equals if and only
> if their coders are equal, so no we'd now have to scan latin-1 encoded
> streams for non-ASCII bytes to disambiguate between ASCII and LATIN1.
>
> Maybe there are some other case that would be helped, but with some
> likely negatives and only a modest potential win then my gut feeling is
> that this wouldn't pay off.
>
> Best regards
> Claes
>
> [1] https://cl4es.github.io/2021/10/17/Faster-Charset-Encoding.html
> <https://urldefense.com/v3/__https://cl4es.github.io/2021/10/17/Faster-Charset-Encoding.html__;!!ACWV5N9M2RV99hQ!I_AH1iAOysINPqrZ3OaqijBHyoZwpzC_5jUnQk-lTGa8-V3RL6CnY8WP3t__jWfDKrUylGMS6E5NEBW7JwNo$>
> [2] https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html
> <https://urldefense.com/v3/__https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html__;!!ACWV5N9M2RV99hQ!I_AH1iAOysINPqrZ3OaqijBHyoZwpzC_5jUnQk-lTGa8-V3RL6CnY8WP3t__jWfDKrUylGMS6E5NEDHadwAE$>
>
> (All the improvements discussed in the blog entires should be available
> since 17.0.2.)
>
> On 2022-05-18 14:07, Glavo wrote:
> > After the introduction of compact strings in Java 9, the current
> String may
> > store byte arrays encoded as Latin1 or UTF-16.
> > Here's a troublesome thing: Latin1 is not compatible with UTF-8.
> Not all
> > Latin1 byte strings are legal UTF-8 byte strings:
> > They are only compatible within the ASCII range, when there are
> code points
> > greater than 127, Latin1 uses a one-byte
> > representation, while UTF-8 requires two bytes.
> >
> > As an example, every time `JavaLangAccess::getBytesNoRepl` is
> called to
> > convert a string to a UTF-8 array,
> > the internal implementation needs to call
> `StringCoding.hasNegatives` to
> > scan the content byte array to determine that
> > the string can be encoded in ASCII, and thus eliminate array copies.
> > Similar steps are performed when calling `str.getBytes(UTF_8)`.
> >
> > So, is it possible to introduce a third possible value for
> `String::coder`:
> > ASCII?
> > This looks like an attractive option, and if we do this, we can
> introduce
> > fast paths for many methods.
> >
> > Of course, I know that this change is not completely free, and
> some methods
> > may bring slight performance
> > degradation due to the need to judge the coder, in particular,
> there may be
> > an impact on the performance of the
> > StringBuilder/StringBuffer. However, given that UTF-8 is by far
> the most
> > commonly used file encoding, the performance
> > benefits of fast paths are likely to cover more scenarios. In
> addition to
> > this, other ASCII compatible encodings may
> > also benefit, such as GB18030(GBK), or ISO 8859 variants. And if
> frozen
> > arrays were introduced into the JDK,
> > there would be more scenarios to enjoy performance improvements.
> >
> > So I would like to ask, is it possible for JDK to improve String
> storage in
> > a similar way in the future?
> > Has anyone explored this issue before?
> >
> > Sorry to bother you all, but I'm very much looking forward to the
> answer to
> > this question.
>
More information about the core-libs-dev
mailing list