<html><body><div dir="ltr"><div>

  
    </div><div><div>

        
        <div dir="ltr">Hello,</div><div dir="ltr"><br></div><div dir="ltr">good that you look into that topic. While it might not be a problem in practice (large buffers are ok, but larger than 1mb seems seldom, especially in multi threaded apps) it is still a condition which can be handled. But with AE ciphers becoming the norm, such large cipher chunks seems to be legacy as well?</div><div dir="ltr"><br></div><div dir="ltr">One additional aspect, when chunking the input, it does not only allow faster inlining, it might also help with reducing the time to get the first byte to the network buffer? (While not waiting for the later segments to be encrypted).</div><div dir="ltr"><br></div><div dir="ltr">Can you clarify. You said JSSE, does this actually happen in TLS usage  - how big are your TLS Records? Isn’t there a 16k limit anyway? <span dir="ltr"><br></span></div><div dir="ltr">Gruss</div><div dir="ltr">Bernd</div><div id="ms-outlook-mobile-signature"><div style="direction:ltr">-- </div><div style="direction:ltr">http://bernd.eckenfels.net</div></div>

    </div>

  
<div> </div><hr style="display:inline-block;width:98%" tabindex="-1"><div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif"><b>Von:</b> security-dev <security-dev-retn@openjdk.org> im Auftrag von Carter Kozak <ckozak@ckozak.net><br><b>Gesendet:</b> Mittwoch, Oktober 26, 2022 5:04 PM<br><b>An:</b> security-dev@openjdk.org <security-dev@openjdk.org>; mullan@openjdk.org <mullan@openjdk.org><br><b>Betreff:</b> Clogged pipes: 50x throughput degradation with large Cipher writes<div> </div></font></div><style type="text/css"><!--

p.MsoNormal, p.MsoNoSpacing

        {margin:0;}

--></style><div>Continuing a conversation I had with Sean Mullan at Java One, for a broader audience.<br></div><div><br></div><div>We tend to believe that bulk operations are good. Large bulk operations give the system the most information at once, allowing it to make more informed decisions. Understanding the hotspot compiler on some level and how the security components interact with it, the observed performance degradation makes sense as a result, but I don’t think it’s obvious or desirable most of those using the JDK. As the industry shifts toward shorter lived and horizontally scalable instances, it becomes more important than ever to deliver cryptography performance consistently and early.<br></div><div><br></div><div>Encryption in Java is usually fast, around 2-3 GiB/second per core using the default OpenJDK JSSE provider on my test system. However, when developers use larger buffers (~10 MiB, perhaps large for networking/TLS, but reasonable for local data), I can observe throughput drop to 60 MiB/second (between 2 and 3 percent of the expected throughput!).<br></div><div><br></div><div>Results from <a href="https://github.com/carterkozak/java-crypto-buffer-performance">https://github.com/carterkozak/java-crypto-buffer-performance</a>:<br></div><pre><div>Benchmark                             (cipher)  (numBytes)  (writeStrategy)   Mode  Cnt     Score     Error  Units<br></div><div>EncryptionBenchmark.encrypt  AES/GCM/NoPadding     1048576    ENTIRE_BUFFER  thrpt    4  2215.898 ± 185.661  ops/s<br></div><div>EncryptionBenchmark.encrypt  AES/GCM/NoPadding    10485760    ENTIRE_BUFFER  thrpt    4     6.427 ±   0.475  ops/s<br></div><div>EncryptionBenchmark.encrypt  AES/GCM/NoPadding   104857600    ENTIRE_BUFFER  thrpt    4     0.620 ±   0.096  ops/s<br></div><div>EncryptionBenchmark.encrypt  AES/CTR/NoPadding     1048576    ENTIRE_BUFFER  thrpt    4  2933.808 ±  17.538  ops/s<br></div><div>EncryptionBenchmark.encrypt  AES/CTR/NoPadding    10485760    ENTIRE_BUFFER  thrpt    4    31.775 ±   1.898  ops/s<br></div><div>EncryptionBenchmark.encrypt  AES/CTR/NoPadding   104857600    ENTIRE_BUFFER  thrpt    4     3.174 ±   0.171  ops/s<br></div></pre><div><br></div><div>Using <code>AES/GCM/NoPadding</code>, large buffers result in a great deal of work within<span style="color: rgb(36, 41, 47);"> </span><a href="https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/com/sun/crypto/provider/GHASH.java#L272-L286">GHASH.processBlocks</a><span style="color: rgb(36, 41, 47);"> which </span><a href="https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/hotspot/share/classfile/vmIntrinsics.hpp#L462-L466">is intrinsified</a>, however the intrinsic isn’t used because the method is called infrequently, and a tremendous amount of work occurs within the default implementation. You can find notes from my initial investigation are <a href="https://github.com/palantir/hadoop-crypto/pull/586#issuecomment-964394587">here (with flame graphs)</a>. When I introduce a wrapper to chunk input buffers into 16 KiB segments (other sizes <a href="https://github.com/palantir/hadoop-crypto/pull/586#issue-1047810949">tested here</a>), we can effectively force the method to warm up, and perform nearly two orders of magnitude better:<br></div><div><br></div><div><a href="https://github.com/carterkozak/java-crypto-buffer-performance#jdk-17">https://github.com/carterkozak/java-crypto-buffer-performance#jdk-17</a><br></div><pre><div>Benchmark                             (cipher)  (numBytes)  (writeStrategy)   Mode  Cnt     Score     Error  Units<br></div><div>EncryptionBenchmark.encrypt  AES/GCM/NoPadding     1048576    ENTIRE_BUFFER  thrpt    4  2215.898 ± 185.661  ops/s<br></div><div>EncryptionBenchmark.encrypt  AES/GCM/NoPadding     1048576          CHUNKED  thrpt    4  2516.770 ± 193.009  ops/s<br></div><div>EncryptionBenchmark.encrypt  AES/GCM/NoPadding    10485760    ENTIRE_BUFFER  thrpt    4     6.427 ±   0.475  ops/s<br></div><div>EncryptionBenchmark.encrypt  AES/GCM/NoPadding    10485760          CHUNKED  thrpt    4   246.956 ±  51.193  ops/s<br></div><div>EncryptionBenchmark.encrypt  AES/GCM/NoPadding   104857600    ENTIRE_BUFFER  thrpt    4     0.620 ±   0.096  ops/s<br></div><div>EncryptionBenchmarkencrypt  AES/GCM/NoPadding   104857600          CHUNKED  thrpt    4    24.633 ±   2.784  ops/s<br></div><div>EncryptionBenchmark.encrypt  AES/CTR/NoPadding     1048576    ENTIRE_BUFFER  thrpt    4  2933.808 ±  17.538  ops/s<br></div><div>EncryptionBenchmark.encrypt  AES/CTR/NoPadding     1048576          CHUNKED  thrpt    4  3277.374 ± 569.573  ops/s<br></div><div>EncryptionBenchmark.encrypt  AES/CTR/NoPadding    10485760    ENTIRE_BUFFER  thrpt    4    31.775 ±   1.898  ops/s<br></div><div>EncryptionBenchmark.encrypt  AES/CTR/NoPadding    10485760          CHUNKED  thrpt    4   332.873 ±  55.589  ops/s<br></div><div>EncryptionBenchmark.encrypt  AES/CTR/NoPadding   104857600    ENTIRE_BUFFER  thrpt    4     3.174 ±   0.171  ops/s<br></div><div>EncryptionBenchmark.encrypt  AES/CTR/NoPadding   104857600          CHUNKED  thrpt    4    33.909 ±   1.675  ops/s

<br></div></pre><div>The 10 MiB full-buffer benchmark is eventually partially optimized after ~3 minutes of encryption on ~10 GiB of data, however in practice this takes much longer because the encrypted data must also be put somewhere, potentially leading to rubber-banding over a network.<br></div><div><br></div><div>While writing this up I <a href="https://github.com/carterkozak/java-crypto-buffer-performance#jdk-19">re-ran my investigation using JDK-19</a> and found, to my surprise, that AES/GCM performed substantially better, warming up quickly, while AES/CTR performance was largely equivalent! It turns out that <a href="https://bugs.openjdk.org/browse/JDK-8273297">JDK-8273297</a>, which aimed to improve performance of an intrinsic, has the side-effect of allowing the intrinsic to be used much faster by <a href="https://github.com/openjdk/jdk/commit/13e9ea9e922030927775345b1abde1313a6ec03f#diff-a533e78f757a3ad64a8d2453bea64a0d426890ef85031452bf74070776ad8be0R575-R596">segmenting inputs to 1mb chunks</a>.<br></div><div><br></div><div>I’ve intentionally avoided suggesting specific solutions, as a layperson I don’t feel confident making explicit recommendations, but my goal is reliably high throughput based on the amount of work done rather than the size of individual operations. As a user, native implementations like <a href="https://tomcat.apache.org/native-doc/">tcnative</a> and <a href="https://github.com/google/conscrypt">Conscrypt</a> provide the performance characteristics I’m looking for, but without the reliability or flexibility of OpenJDK JSSE. Is there a solution which allows us to get the best of both worlds?<br></div><div><br></div><div>Thanks,<br></div><div>Carter Kozak<br></div></div></div></body></html>