RFR 8025003: Base64 should be less strict with padding

Tue Nov 12 23:32:49 UTC 2013

This still seems like an inconsistent, and inconvenient, approach to me.

You've decided that some encoding errors (i.e., missing pad characters)
can be ignored.  You're willing to assume that the missing characters aren't
missing data but just missing padding.  But if you find a padding character
where you don't expect it you won't assume that the missing data is zero.

These errors, and others, could represent actual missing data that the
user of the library should be notified about.  As it is, you're not
implementing "strict" and you're not implementing "lenient", you're
implementing "half-lenient".  You might as well implement strict and
require the user of the library to always wrap it in a way that all
errors are ignored in order to provide "lenient".

That might be ok if it were easier to wrap the library to ignore errors.
Perhaps a FilterInputStream could be written that would ignore the errors?
Perhaps you could even provide such a stream?

But that won't help with the non-stream methods.  The ByteBuffer approach
is pretty klunky.  If you think it's reasonable, it would help a lot to
include it as an example in the javadocs.

The fundamental problem here is that there's really nothing reasonable to
do if the data is not properly encoded.  You either reflect the error to
the user and tell them they can't have the data because it was corrupted,
or you make a best effort to give them as much data as you can and let the
user decide if it was corrupted.  It's not as if someone is going to write
a program that looks at the corrupt data and "corrects" it based on how
it was corrupted.

Xueming Shen wrote on 11/12/13 14:39:
> Hi Bill,
> 
> I'm still not convinced that the Base64.Decoder should guess the missing bits
> (to fill with zero) for the last dangling single byte in the input stream, even in
> lenient mode. That said I understand it might be really desired to still be able
> to decode such malformed base64 byte stream in the real world use scenario.
> I am trying to find a solution that can addresses the real issue without
> compromising the integrity of the input data. There is an "advanced" decode
> method Base64.Decoder(ByteBuffer, ByteBuffer), it is currently specified and
> implemented as
> 
> 
>          * <p> If the input buffer is not in valid Base64 encoding scheme
>          * then some bytes may have been written to the output buffer
>          * before IllegalArgumentException is thrown. The positions of
>          * both input and output buffer will not be advanced in this case.
>          *
> 
> So, if the stream is malformed, the current implementation of decode()
> method throws IAE and reset the in and out buffer back to their original
> position (throw away the decoded resulting bytes)
> 
> It might be reasonable to change it to
> 
>          * <p> The decoding operation will stop and return {@code -1} if
>          * the input buffer is not in valid Base64 encoding scheme and
>          * the malformed-input error has been detected. The malformed
>          * bytes begin at the input buffer's current (possibly advanced)
>          * position. The output buffer's position will be advanced to
>          * reflect the bytes written so far.
> 
> which means when there is malformed byte sequence, instead of throwing
> an IAE it now returns -1 "normally" (better flow control?) and leaves the
> positions of the input and output buffer at the place where it stops. So you
> can recover the decoded result from the output buffer, and find out where
> the malformed byte sequence starts (if desirable)
> 
>     ByteBuffer src = ByteBuffer.wrap(src_byte_array);
>     ByteBuffer dst = ByteBuffer.wrap(dst_byte_array);
>     int ret = dec.decode(src, dst);
>     dst.flip();
>     // do something for the resulting bytes
> 
>     if (ret < 0) {
>         // do something for the malformed bytes in src
>     }
> 
> Instead of -1, the return value can be a "negative value" of the length
> of the bytes written to the output buffer, if really needed. Though the
> "position" and "limit" of the ByteBuffer should provide enough info for
> the access.
> 
> The error recovery mechanism appears to work perfectly for your use
> scenario:-) the "only" downside"/inconvenience is that you will need to
> wrap your byte array input/output with the java.nio ByteBuffer (which is
> out recommended replacement for byte[] + length + offset anyway).
> 
> http://cr.openjdk.java.net/~sherman/base64_malformed/webrev/
> 
> Opinion?
> 
> Thanks!
> -Sherman
> 
> 
> On 11/08/2013 02:35 PM, Bill Shannon wrote:
>> Have you had a chance to think about this?  Can the MIME decoder be made
>> more lenient, or can I get an option to control this?
>>
>> Bill Shannon wrote on 10/25/13 15:24:
>>> Xueming Shen wrote on 10/25/13 15:19:
>>>> On 10/25/13 2:19 PM, Bill Shannon wrote:
>>>>> If I understand this correctly, this proposes to remove the "lenient"
>>>>> option we've been discussing and just make it always lenient.  Is that
>>>>> correct?
>>>> Yes. Only for the mime type though.
>>> That's fine.
>>>
>>>>> Unfortunately, from what you say below, it's still not lenient enough.
>>>>> I'd really like a version that never, ever, for any reason, throws an
>>>>> exception.  Yes, that means when you only get a final 6 bits of data
>>>>> you have to make an assumption about what was intended, probably padding
>>>>> it with zeros to 8 bits.
>>>> This is something I'm hesitated to do. I can be lenient for the padding end
>>>> because the
>>>> padding character itself is not the real "data", whether or not it's present,
>>>> it's missing or
>>>> it's incorrect/incomplete, it does not impact the integrity of the data. But to
>>>> feed the last
>>>> 6 bits with zero, is really kinda of guessing, NOT decoding.
>>> I understand.  And if the people who write spamming software knew how to
>>> read a spec, we wouldn't have this problem!  :-)
>>>
>>> Still, there's a lot of bad data out there on the internet, and people
>>> want the software to do the best job it can to interpret the data.  It's
>>> better to guess at the missing 2 bits of data than to lose the last 6 bits
>>> of data.
>