potential performance improvement in sun.nio.cs.UTF_8

Mon May 12 14:50:38 UTC 2025

Hi Johannes,
I think the 3rd scenario you've mentioned is likely: we have Swedish or other languages that extend the ascii encoding with diacritics, which are non-ascii bytes are frequently interrupting ascii. For non-ascii heavy languages like Chinese, sometimes the text can include spaces or ascii digits; invoking the intrinsic for that scenario sounds a bit unwise too.

Regards,
Chen Liang
________________________________
From: core-libs-dev <core-libs-dev-retn at openjdk.org> on behalf of Johannes Döbler <jd at civilian-framework.org>
Sent: Monday, May 12, 2025 6:16 AM
To: core-libs-dev at openjdk.org <core-libs-dev at openjdk.org>
Subject: potential performance improvement in sun.nio.cs.UTF_8

I have a suggestion for a performance improvement in sun.nio.cs.UTF_8, the workhorse for stream based UTF-8 encoding and decoding, but don't know if this has been discussed before.
I explain my idea for the decoding case:
Claes Redestad describes in his blog https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html  how he used SIMD intrinsics (now JavaLangAccess.decodeASCII) to speed UTF_8 decoding when buffers are backed by arrays:

https://github.com/openjdk/jdk/blob/0258d9998ebc523a6463818be00353c6ac8b7c9c/src/java.base/share/classes/sun/nio/cs/UTF_8.java#L231

  *   first a call to JLA.decodeASCII harvests all ASCII-characters (=1-byte UTF-8 sequence) at the beginning of the input
  *   then enters the slow loop of looking at UTF-8 byte sequences in the input buffer and writing to the output buffer (this is basically the old implementation)

If the input is all ASCII all decoding work is done in JLA.decodeASCII resulting in an extreme performance boost. But if the input contains a non-ASCII it will fall back to the slow array loop.

Now here is my idea: Why not call JLA.decodeASCI whenever an ASCII input is seen:

while (sp < sl) {
    int b1 = sa[sp];
    if (b1 >= 0) {
        // 1 byte, 7 bits: 0xxxxxxx
        if (dp >= dl)
            return xflow(src, sp, sl, dst, dp, 1);
        // my change
        int n = JLA.decodeASCII(sa, sp, da, dp, Math.min(sl - sp, dl - dp));
        sp += n;
        dp += n;
    } else if ((b1 >> 5) == -2 && (b1 & 0x1e) != 0) {

I setup a small improvised benchmark to get an idea of the impact:

Benchmark                     (data)   Mode  Cnt        Score   Error  Units
DecoderBenchmark.jdkDecoder  TD_8000  thrpt    2  2045960,037          ops/s
DecoderBenchmark.jdkDecoder  TD_3999  thrpt    2   263744,675          ops/s
DecoderBenchmark.jdkDecoder   TD_999  thrpt    2   154232,940          ops/s
DecoderBenchmark.jdkDecoder   TD_499  thrpt    2   142239,763          ops/s
DecoderBenchmark.jdkDecoder    TD_99  thrpt    2   128678,229          ops/s
DecoderBenchmark.jdkDecoder     TD_9  thrpt    2   127388,649          ops/s
DecoderBenchmark.jdkDecoder     TD_4  thrpt    2   119834,183          ops/s
DecoderBenchmark.jdkDecoder     TD_2  thrpt    2   111733,115          ops/s
DecoderBenchmark.jdkDecoder     TD_1  thrpt    2   102397,455          ops/s
DecoderBenchmark.newDecoder  TD_8000  thrpt    2  2022997,518          ops/s
DecoderBenchmark.newDecoder  TD_3999  thrpt    2  2909450,005          ops/s
DecoderBenchmark.newDecoder   TD_999  thrpt    2  2140307,712          ops/s
DecoderBenchmark.newDecoder   TD_499  thrpt    2  1171970,809          ops/s
DecoderBenchmark.newDecoder    TD_99  thrpt    2   686771,614          ops/s
DecoderBenchmark.newDecoder     TD_9  thrpt    2    95181,541          ops/s
DecoderBenchmark.newDecoder     TD_4  thrpt    2    65656,184          ops/s
DecoderBenchmark.newDecoder     TD_2  thrpt    2    45439,240          ops/s
DecoderBenchmark.newDecoder     TD_1  thrpt    2    36994,738          ops/s

(The benchmark uses only memory buffers, each test input is a UTF-8 encoded byte buffer which produces 8000 chars and consists of various length of pure ascii bytes, followed by a 2-byte UTF-8 sequence producing a non-ASCII char:
TD_8000: 8000 ascii bytes -> 1 call to JLA.decodeASCII
TD_3999: 3999 ascii bytes + 2 non-ascii bytes, repeated 2 times -> 2 calls to JLA.decodeASCII
...
TD_1: 1 ascii byte + 2 non-ascii bytes, repeated 4000 times -> 4000 calls to JLA.decodeASCII

Interpretation:

  *   Input all ASCII: same performance as before
  *   Input contains pure ASCII sequence of considerable length interrupted by non ASCII bytes: now seeing huge performance improvements similar to the pure ASCII case.
  *   Input has lot of short sequences of ASCII-bytes interrupted by non ASCII bytes: at some point performance drops below current implementation.

Thanks for reading and happy to hear your opinions,
Johannes

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20250512/f5e57b73/attachment-0001.htm>