RFR 8025003: Base64 should be less strict with padding

Tue Nov 12 22:39:19 UTC 2013

Hi Bill,

I'm still not convinced that the Base64.Decoder should guess the missing bits
(to fill with zero) for the last dangling single byte in the input stream, even in
lenient mode. That said I understand it might be really desired to still be able
to decode such malformed base64 byte stream in the real world use scenario.
I am trying to find a solution that can addresses the real issue without
compromising the integrity of the input data. There is an "advanced" decode
method Base64.Decoder(ByteBuffer, ByteBuffer), it is currently specified and
implemented as

          * <p> If the input buffer is not in valid Base64 encoding scheme
          * then some bytes may have been written to the output buffer
          * before IllegalArgumentException is thrown. The positions of
          * both input and output buffer will not be advanced in this case.
          *

So, if the stream is malformed, the current implementation of decode()
method throws IAE and reset the in and out buffer back to their original
position (throw away the decoded resulting bytes)

It might be reasonable to change it to

          * <p> The decoding operation will stop and return {@code -1} if
          * the input buffer is not in valid Base64 encoding scheme and
          * the malformed-input error has been detected. The malformed
          * bytes begin at the input buffer's current (possibly advanced)
          * position. The output buffer's position will be advanced to
          * reflect the bytes written so far.

which means when there is malformed byte sequence, instead of throwing
an IAE it now returns -1 "normally" (better flow control?) and leaves the
positions of the input and output buffer at the place where it stops. So you
can recover the decoded result from the output buffer, and find out where
the malformed byte sequence starts (if desirable)

     ByteBuffer src = ByteBuffer.wrap(src_byte_array);
     ByteBuffer dst = ByteBuffer.wrap(dst_byte_array);
     int ret = dec.decode(src, dst);
     dst.flip();
     // do something for the resulting bytes

     if (ret < 0) {
         // do something for the malformed bytes in src
     }

Instead of -1, the return value can be a "negative value" of the length
of the bytes written to the output buffer, if really needed. Though the
"position" and "limit" of the ByteBuffer should provide enough info for
the access.

The error recovery mechanism appears to work perfectly for your use
scenario:-) the "only" downside"/inconvenience is that you will need to
wrap your byte array input/output with the java.nio ByteBuffer (which is
out recommended replacement for byte[] + length + offset anyway).

http://cr.openjdk.java.net/~sherman/base64_malformed/webrev/

Opinion?

Thanks!
-Sherman

On 11/08/2013 02:35 PM, Bill Shannon wrote:
> Have you had a chance to think about this?  Can the MIME decoder be made
> more lenient, or can I get an option to control this?
>
> Bill Shannon wrote on 10/25/13 15:24:
>> Xueming Shen wrote on 10/25/13 15:19:
>>> On 10/25/13 2:19 PM, Bill Shannon wrote:
>>>> If I understand this correctly, this proposes to remove the "lenient"
>>>> option we've been discussing and just make it always lenient.  Is that
>>>> correct?
>>> Yes. Only for the mime type though.
>> That's fine.
>>
>>>> Unfortunately, from what you say below, it's still not lenient enough.
>>>> I'd really like a version that never, ever, for any reason, throws an
>>>> exception.  Yes, that means when you only get a final 6 bits of data
>>>> you have to make an assumption about what was intended, probably padding
>>>> it with zeros to 8 bits.
>>> This is something I'm hesitated to do. I can be lenient for the padding end
>>> because the
>>> padding character itself is not the real "data", whether or not it's present,
>>> it's missing or
>>> it's incorrect/incomplete, it does not impact the integrity of the data. But to
>>> feed the last
>>> 6 bits with zero, is really kinda of guessing, NOT decoding.
>> I understand.  And if the people who write spamming software knew how to
>> read a spec, we wouldn't have this problem!  :-)
>>
>> Still, there's a lot of bad data out there on the internet, and people
>> want the software to do the best job it can to interpret the data.  It's
>> better to guess at the missing 2 bits of data than to lose the last 6 bits
>> of data.