RFR 8025003: Base64 should be less strict with padding

Wed Nov 13 00:25:03 UTC 2013

On 11/12/2013 03:32 PM, Bill Shannon wrote:
> This still seems like an inconsistent, and inconvenient, approach to me.
>
> You've decided that some encoding errors (i.e., missing pad characters)
> can be ignored.  You're willing to assume that the missing characters aren't
> missing data but just missing padding.  But if you find a padding character
> where you don't expect it you won't assume that the missing data is zero.

"missing pad characters" in theory is not an encoding errors. As the RFC suggested, the
use of padding in base64 data is not required or used. They mainly serve the purpose of
providing the indication of "end of the data". This is why the padding character(s) is not
required (optional) by our decoder at first place. However, if the padding character(s) is
present, they need to be correctly encoded, otherwise, it's a malformed base64 stream.

To address your strong request fore more "lenient" MIME decoder, we have updated the
spec and implementation to be a reasonable liberal for the incorrect padding at the end
of the mime base64 data as showed below

      xxxx =       unnecessary padding character at the end of encoded stream
      xxxx xx=     missing the last padding character
      xxxx xx=y    missing the last padding character, instead having a non-padding char

With the assumption that it still follows the "spirit" of the purpose of padding
character (as suggested by the RFC), to indicate the end of the data stream, no more
decoding is needed beyond the padding character. Yes, it makes the MIME decoder somewhat
"inconsistent" with our original design and the rest of other type of decoders, but we
thought it might provide the "convenience" requested.

But a single tangling byte at the end of the encoded data stream is obvious an encoding
error or transportation error. As I said, I don't think the decoder should try to rescue with
guess. The proposed change is to try to provide a simple mechanism that the application
can do some lifecircle/error management to recovery from the malformed data stream, if
desired. This is actually is NOT what j.u.Base64 is desired for. The primary goal is to provide
a set of easy/simple utility methods for base64 encoding/decoding, not such complicated
error recovery management, as the java.nio.charset.De/Encoder provides.

The JavaDoc definitely can be improved to provide a detailed use case, sample, if it
helps. But if it's definitely a no-go, maybe we can leave this for jdk9 for bigger surgery.

-Sherman.

> These errors, and others, could represent actual missing data that the
> user of the library should be notified about.  As it is, you're not
> implementing "strict" and you're not implementing "lenient", you're
> implementing "half-lenient".  You might as well implement strict and
> require the user of the library to always wrap it in a way that all
> errors are ignored in order to provide "lenient".
>
> That might be ok if it were easier to wrap the library to ignore errors.
> Perhaps a FilterInputStream could be written that would ignore the errors?
> Perhaps you could even provide such a stream?
>
> But that won't help with the non-stream methods.  The ByteBuffer approach
> is pretty klunky.  If you think it's reasonable, it would help a lot to
> include it as an example in the javadocs.
>
> The fundamental problem here is that there's really nothing reasonable to
> do if the data is not properly encoded.  You either reflect the error to
> the user and tell them they can't have the data because it was corrupted,
> or you make a best effort to give them as much data as you can and let the
> user decide if it was corrupted.  It's not as if someone is going to write
> a program that looks at the corrupt data and "corrects" it based on how
> it was corrupted.
>
> Xueming Shen wrote on 11/12/13 14:39:
>> Hi Bill,
>>
>> I'm still not convinced that the Base64.Decoder should guess the missing bits
>> (to fill with zero) for the last dangling single byte in the input stream, even in
>> lenient mode. That said I understand it might be really desired to still be able
>> to decode such malformed base64 byte stream in the real world use scenario.
>> I am trying to find a solution that can addresses the real issue without
>> compromising the integrity of the input data. There is an "advanced" decode
>> method Base64.Decoder(ByteBuffer, ByteBuffer), it is currently specified and
>> implemented as
>>
>>
>>           *<p>  If the input buffer is not in valid Base64 encoding scheme
>>           * then some bytes may have been written to the output buffer
>>           * before IllegalArgumentException is thrown. The positions of
>>           * both input and output buffer will not be advanced in this case.
>>           *
>>
>> So, if the stream is malformed, the current implementation of decode()
>> method throws IAE and reset the in and out buffer back to their original
>> position (throw away the decoded resulting bytes)
>>
>> It might be reasonable to change it to
>>
>>           *<p>  The decoding operation will stop and return {@code -1} if
>>           * the input buffer is not in valid Base64 encoding scheme and
>>           * the malformed-input error has been detected. The malformed
>>           * bytes begin at the input buffer's current (possibly advanced)
>>           * position. The output buffer's position will be advanced to
>>           * reflect the bytes written so far.
>>
>> which means when there is malformed byte sequence, instead of throwing
>> an IAE it now returns -1 "normally" (better flow control?) and leaves the
>> positions of the input and output buffer at the place where it stops. So you
>> can recover the decoded result from the output buffer, and find out where
>> the malformed byte sequence starts (if desirable)
>>
>>      ByteBuffer src = ByteBuffer.wrap(src_byte_array);
>>      ByteBuffer dst = ByteBuffer.wrap(dst_byte_array);
>>      int ret = dec.decode(src, dst);
>>      dst.flip();
>>      // do something for the resulting bytes
>>
>>      if (ret<  0) {
>>          // do something for the malformed bytes in src
>>      }
>>
>> Instead of -1, the return value can be a "negative value" of the length
>> of the bytes written to the output buffer, if really needed. Though the
>> "position" and "limit" of the ByteBuffer should provide enough info for
>> the access.
>>
>> The error recovery mechanism appears to work perfectly for your use
>> scenario:-) the "only" downside"/inconvenience is that you will need to
>> wrap your byte array input/output with the java.nio ByteBuffer (which is
>> out recommended replacement for byte[] + length + offset anyway).
>>
>> http://cr.openjdk.java.net/~sherman/base64_malformed/webrev/
>>
>> Opinion?
>>
>> Thanks!
>> -Sherman
>>
>>
>> On 11/08/2013 02:35 PM, Bill Shannon wrote:
>>> Have you had a chance to think about this?  Can the MIME decoder be made
>>> more lenient, or can I get an option to control this?
>>>
>>> Bill Shannon wrote on 10/25/13 15:24:
>>>> Xueming Shen wrote on 10/25/13 15:19:
>>>>> On 10/25/13 2:19 PM, Bill Shannon wrote:
>>>>>> If I understand this correctly, this proposes to remove the "lenient"
>>>>>> option we've been discussing and just make it always lenient.  Is that
>>>>>> correct?
>>>>> Yes. Only for the mime type though.
>>>> That's fine.
>>>>
>>>>>> Unfortunately, from what you say below, it's still not lenient enough.
>>>>>> I'd really like a version that never, ever, for any reason, throws an
>>>>>> exception.  Yes, that means when you only get a final 6 bits of data
>>>>>> you have to make an assumption about what was intended, probably padding
>>>>>> it with zeros to 8 bits.
>>>>> This is something I'm hesitated to do. I can be lenient for the padding end
>>>>> because the
>>>>> padding character itself is not the real "data", whether or not it's present,
>>>>> it's missing or
>>>>> it's incorrect/incomplete, it does not impact the integrity of the data. But to
>>>>> feed the last
>>>>> 6 bits with zero, is really kinda of guessing, NOT decoding.
>>>> I understand.  And if the people who write spamming software knew how to
>>>> read a spec, we wouldn't have this problem!  :-)
>>>>
>>>> Still, there's a lot of bad data out there on the internet, and people
>>>> want the software to do the best job it can to interpret the data.  It's
>>>> better to guess at the missing 2 bits of data than to lose the last 6 bits
>>>> of data.