RFR JDK-8164278: java.util.Base64.EncOutputStream/DecInputStream is slower than corresponding version in javax.mail package
Hi, Please help review the change for JDK-8164278. issue: https://bugs.openjdk.java.net/browse/JDK-8164278 webrev: http://cr.openjdk.java.net/~sherman/8164278/webrev jmh.src: http://cr.openjdk.java.net/~sherman/8164278/Base64BM.java jmh.result: http://cr.openjdk.java.net/~sherman/8164278/base64.bm Base64.Decoder.decode0: Adopted the "similar" optimization approach we took in Base64.Encoder.encode0() to add a "fast path" to decode a block of 4-byte units together (current implementation decodes one single byte per while/loop. The jmh benchmark result indicates a big speed boost (those decodeArray/Mime/Url results, from 30% to 2 times faster, depends on input size). Base64.Encoder.encode0() It appears encode0() was fully optimized in 1.8. Can't get it faster :-) Tried to use Unsafe.getLong/putLong instead of byte by byte access. But it appears the 8-byte "vectorization" does not bring us enough speed up, the performance is the same as the current one. See encode00() at http://cr.openjdk.java.net/~sherman/8164278/webrev.00 Base64.Encoder.wrap(OutputStream)/EncOutputStream.write(): If my memory serves me right, the current implementation was under the assumption that the underlying output stream probably is buffered (if invoker cares). It would be a redundant if EncOutputStream buffers bytes again. It appears this is a wrong assumption. It is much slower to write 4 bytes separately, compared to bundle them together in a byte[4] and write into underlying, even the underlying output stream is a ByteArrayOutputStream. Again, the proposed change is to add a fast path loop, as we do in encode0(), to decode/ write a block of 3-byte->4-byte unites. It appears this fast loop can help the compiler to optimize away some boundary checks, therefor is much faster. The jmh result Base64BM.encodeOS suggests the new implementation is almost 4 times faster and is almost the same as java.mail's stream encoder. Base64.Decoder.wrap(InputStream)/DecInputStream.read(): Same as the approach we take for decode0(), to add a fast path decode block of 4-byte unites together. The jmh result Base64BM.decodeOS (the name probably should be decodeIS, for InputStream, but anyway...) shows the proposed one is 4 times faster than the existing impl and double the java.mail (Base64BM.decodeOS_javamail) implementation. However, there is a side-effect of adding a buffering mechanism into DecInputStream. The current implementation read bytes from the underlying stream one by one, it never reads more bytes than it needs, which means it should/is supposed to just stop at the last byte that it needs to decode, when there is "=" present in the stream. With buffering, it's possible more bytes (after "=", which indicates "end of base64 stream") might be read/consumed in and buffered. A concern? if this is indeed a concern, the only alternative might be to add a separate method to support this "faster-buffered-decoder"? Thanks, Sherman
Hi Sherman, On 2/5/2018 9:00 PM, Xueming Shen wrote:
Hi,
Please help review the change for JDK-8164278.
issue: https://bugs.openjdk.java.net/browse/JDK-8164278 webrev: http://cr.openjdk.java.net/~sherman/8164278/webrev Are the reentrant locks necessary? concurrent reads from streams are not usually synchronized so its the caller that need to synchronize. If locks are necessary, why no lock for the EncOutputStream buffer?
809: Can the buffer byte array be sized based on linemax? The field declaration should be with the other fields at the top of the file. 848: checkNewline compares == with linemax; that works when each byte is counted separately It seems like it would safer if it was ">=". 957: can sbBuf be shared with lbuf? More on the input buffering question below.
jmh.src: http://cr.openjdk.java.net/~sherman/8164278/Base64BM.java jmh.result: http://cr.openjdk.java.net/~sherman/8164278/base64.bm
Base64.Decoder.decode0: Adopted the "similar" optimization approach we took in Base64.Encoder.encode0() to add a "fast path" to decode a block of 4-byte units together (current implementation decodes one single byte per while/loop. The jmh benchmark result indicates a big speed boost (those decodeArray/Mime/Url results, from 30% to 2 times faster, depends on input size).
:)
Base64.Encoder.encode0() It appears encode0() was fully optimized in 1.8. Can't get it faster :-) Tried to use Unsafe.getLong/putLong instead of byte by byte access. But it appears the 8-byte "vectorization" does not bring us enough speed up, the performance is the same as the current one. See encode00() at http://cr.openjdk.java.net/~sherman/8164278/webrev.00
Base64.Encoder.wrap(OutputStream)/EncOutputStream.write(): If my memory serves me right, the current implementation was under the assumption that the underlying output stream probably is buffered (if invoker cares). It would be a redundant if EncOutputStream buffers bytes again. It appears this is a wrong assumption. It is much slower to write 4 bytes separately, compared to bundle them together in a byte[4] and write into underlying, even the underlying output stream is a ByteArrayOutputStream. Again, the proposed change is to add a fast path loop, as we do in encode0(), to decode/ write a block of 3-byte->4-byte unites. It appears this fast loop can help the compiler to optimize away some boundary checks, therefor is much faster. The jmh result Base64BM.encodeOS suggests the new implementation is almost 4 times faster and is almost the same as java.mail's stream encoder.
Base64.Decoder.wrap(InputStream)/DecInputStream.read(): Same as the approach we take for decode0(), to add a fast path decode block of 4-byte unites together. The jmh result Base64BM.decodeOS (the name probably should be decodeIS, for InputStream, but anyway...) shows the proposed one is 4 times faster than the existing impl and double the java.mail (Base64BM.decodeOS_javamail) implementation.
However, there is a side-effect of adding a buffering mechanism into DecInputStream. The current implementation read bytes from the underlying stream one by one, it never reads more bytes than it needs, which means it should/is supposed to just stop at the last byte that it needs to decode, when there is "=" present in the stream. With buffering, it's possible more bytes (after "=", which indicates "end of base64 stream") might be read/consumed in and buffered. A concern? if this is indeed a concern, the only alternative might be to add a separate method to support this "faster-buffered-decoder"?
How much buffering is needed to speed it up? Can the mark/reset functions of the underlying stream be used to backup the stream if it overshoots? If mark and reset are not supported then read 1 byte at a time. Regards, Roger
Thanks, Sherman
On 2/6/18, 8:28 AM, Roger Riggs wrote:
Hi Sherman,
On 2/5/2018 9:00 PM, Xueming Shen wrote:
Hi,
Please help review the change for JDK-8164278.
issue: https://bugs.openjdk.java.net/browse/JDK-8164278 webrev: http://cr.openjdk.java.net/~sherman/8164278/webrev Are the reentrant locks necessary? concurrent reads from streams are not usually synchronized so its the caller that need to synchronize. If locks are necessary, why no lock for the EncOutputStream buffer?
Thanks Roger, I don't really care the "correctnesss" of the de/coding via the in/output stream under concurrent access. As you said the caller should take the responsibility for that. In case of DecInputStream there is risk that the "pos" might be updated out of the buffer boundary and trigger exception, if under concurrent access. It's not the case for EncOutputStream. But yes, let me see if I can get away with this via local copy of the pos/cnt.
809: Can the buffer byte array be sized based on linemax? The field declaration should be with the other fields at the top of the file.
sure possible with some calculation.
848: checkNewline compares == with linemax; that works when each byte is counted separately It seems like it would safer if it was ">=".
updated.
957: can sbBuf be shared with lbuf?
probably not. lbuf is for the bytes from underlying input stream. sbBuf is to store/get the first byte of the decoded result.
However, there is a side-effect of adding a buffering mechanism into DecInputStream. The current implementation read bytes from the underlying stream one by one, it never reads more bytes than it needs, which means it should/is supposed to just stop at the last byte that it needs to decode, when there is "=" present in the stream. With buffering, it's possible more bytes (after "=", which indicates "end of base64 stream") might be read/consumed in and buffered. A concern? if this is indeed a concern, the only alternative might be to add a separate method to support this "faster-buffered-decoder"?
How much buffering is needed to speed it up? Can the mark/reset functions of the underlying stream be used to backup the stream if it overshoots? If mark and reset are not supported then read 1 byte at a time.
will take a look at the mark/reset possibility. maybe I will put DecInputStream optimization to a separate rfe. -Sherman
Hi Roger, Given the concern of the possible incompatible behavior change of over reading bytes from the underlying stream. I decided to give up last proposed changes for DecInputStream for now. With some "minor" cleanup and tuning I still have about 10%+ improvement with various input size sampling. Let's address the decoder stream in separate rfe later, if desired. The updated webrev and the jmh results are at http://cr.openjdk.java.net/~sherman/8164278/webrev http://cr.openjdk.java.net/~sherman/8164278/base64.bm Last version of webrev and corresponding jmh result can be found at http://cr.openjdk.java.net/~sherman/8164278/webrev.02 http://cr.openjdk.java.net/~sherman/8164278/base64.bm.old Thanks, Sherman On 2/5/18, 6:00 PM, Xueming Shen wrote:
Hi,
Please help review the change for JDK-8164278.
issue: https://bugs.openjdk.java.net/browse/JDK-8164278 webrev: http://cr.openjdk.java.net/~sherman/8164278/webrev
jmh.src: http://cr.openjdk.java.net/~sherman/8164278/Base64BM.java jmh.result: http://cr.openjdk.java.net/~sherman/8164278/base64.bm
Base64.Decoder.decode0: Adopted the "similar" optimization approach we took in Base64.Encoder.encode0() to add a "fast path" to decode a block of 4-byte units together (current implementation decodes one single byte per while/loop. The jmh benchmark result indicates a big speed boost (those decodeArray/Mime/Url results, from 30% to 2 times faster, depends on input size).
Base64.Encoder.encode0() It appears encode0() was fully optimized in 1.8. Can't get it faster :-) Tried to use Unsafe.getLong/putLong instead of byte by byte access. But it appears the 8-byte "vectorization" does not bring us enough speed up, the performance is the same as the current one. See encode00() at http://cr.openjdk.java.net/~sherman/8164278/webrev.00
Base64.Encoder.wrap(OutputStream)/EncOutputStream.write(): If my memory serves me right, the current implementation was under the assumption that the underlying output stream probably is buffered (if invoker cares). It would be a redundant if EncOutputStream buffers bytes again. It appears this is a wrong assumption. It is much slower to write 4 bytes separately, compared to bundle them together in a byte[4] and write into underlying, even the underlying output stream is a ByteArrayOutputStream. Again, the proposed change is to add a fast path loop, as we do in encode0(), to decode/ write a block of 3-byte->4-byte unites. It appears this fast loop can help the compiler to optimize away some boundary checks, therefor is much faster. The jmh result Base64BM.encodeOS suggests the new implementation is almost 4 times faster and is almost the same as java.mail's stream encoder.
Base64.Decoder.wrap(InputStream)/DecInputStream.read(): Same as the approach we take for decode0(), to add a fast path decode block of 4-byte unites together. The jmh result Base64BM.decodeOS (the name probably should be decodeIS, for InputStream, but anyway...) shows the proposed one is 4 times faster than the existing impl and double the java.mail (Base64BM.decodeOS_javamail) implementation.
However, there is a side-effect of adding a buffering mechanism into DecInputStream. The current implementation read bytes from the underlying stream one by one, it never reads more bytes than it needs, which means it should/is supposed to just stop at the last byte that it needs to decode, when there is "=" present in the stream. With buffering, it's possible more bytes (after "=", which indicates "end of base64 stream") might be read/consumed in and buffered. A concern? if this is indeed a concern, the only alternative might be to add a separate method to support this "faster-buffered-decoder"?
Thanks, Sherman
Hi Sherman, I found updates in http://cr.openjdk.java.net/~sherman/8164278/webrev.04 That looks fine. Typo: line 1008: "neve" Roger On 2/7/2018 10:32 PM, Xueming Shen wrote:
Hi Roger,
Given the concern of the possible incompatible behavior change of over reading bytes from the underlying stream. I decided to give up last proposed changes for DecInputStream for now. With some "minor" cleanup and tuning I still have about 10%+ improvement with various input size sampling. Let's address the decoder stream in separate rfe later, if desired.
The updated webrev and the jmh results are at
http://cr.openjdk.java.net/~sherman/8164278/webrev http://cr.openjdk.java.net/~sherman/8164278/base64.bm
Last version of webrev and corresponding jmh result can be found at http://cr.openjdk.java.net/~sherman/8164278/webrev.02 http://cr.openjdk.java.net/~sherman/8164278/base64.bm.old
Thanks, Sherman
On 2/5/18, 6:00 PM, Xueming Shen wrote:
Hi,
Please help review the change for JDK-8164278.
issue: https://bugs.openjdk.java.net/browse/JDK-8164278 webrev: http://cr.openjdk.java.net/~sherman/8164278/webrev
jmh.src: http://cr.openjdk.java.net/~sherman/8164278/Base64BM.java jmh.result: http://cr.openjdk.java.net/~sherman/8164278/base64.bm
Base64.Decoder.decode0: Adopted the "similar" optimization approach we took in Base64.Encoder.encode0() to add a "fast path" to decode a block of 4-byte units together (current implementation decodes one single byte per while/loop. The jmh benchmark result indicates a big speed boost (those decodeArray/Mime/Url results, from 30% to 2 times faster, depends on input size).
Base64.Encoder.encode0() It appears encode0() was fully optimized in 1.8. Can't get it faster :-) Tried to use Unsafe.getLong/putLong instead of byte by byte access. But it appears the 8-byte "vectorization" does not bring us enough speed up, the performance is the same as the current one. See encode00() at http://cr.openjdk.java.net/~sherman/8164278/webrev.00
Base64.Encoder.wrap(OutputStream)/EncOutputStream.write(): If my memory serves me right, the current implementation was under the assumption that the underlying output stream probably is buffered (if invoker cares). It would be a redundant if EncOutputStream buffers bytes again. It appears this is a wrong assumption. It is much slower to write 4 bytes separately, compared to bundle them together in a byte[4] and write into underlying, even the underlying output stream is a ByteArrayOutputStream. Again, the proposed change is to add a fast path loop, as we do in encode0(), to decode/ write a block of 3-byte->4-byte unites. It appears this fast loop can help the compiler to optimize away some boundary checks, therefor is much faster. The jmh result Base64BM.encodeOS suggests the new implementation is almost 4 times faster and is almost the same as java.mail's stream encoder.
Base64.Decoder.wrap(InputStream)/DecInputStream.read(): Same as the approach we take for decode0(), to add a fast path decode block of 4-byte unites together. The jmh result Base64BM.decodeOS (the name probably should be decodeIS, for InputStream, but anyway...) shows the proposed one is 4 times faster than the existing impl and double the java.mail (Base64BM.decodeOS_javamail) implementation.
However, there is a side-effect of adding a buffering mechanism into DecInputStream. The current implementation read bytes from the underlying stream one by one, it never reads more bytes than it needs, which means it should/is supposed to just stop at the last byte that it needs to decode, when there is "=" present in the stream. With buffering, it's possible more bytes (after "=", which indicates "end of base64 stream") might be read/consumed in and buffered. A concern? if this is indeed a concern, the only alternative might be to add a separate method to support this "faster-buffered-decoder"?
Thanks, Sherman
participants (2)
-
Roger Riggs
-
Xueming Shen