RFR 8025003: Base64 should be less strict with padding

Wed Nov 13 18:35:06 UTC 2013

Xueming Shen wrote on 11/13/13 08:28:
> On 11/12/13 11:44 PM, Bill Shannon wrote:
>> Xueming Shen wrote on 11/12/2013 09:24 PM:
>>> On 11/12/13 8:21 PM, Bill Shannon wrote:
>>>> Xueming Shen wrote on 11/12/2013 04:25 PM:
>>>>> On 11/12/2013 03:32 PM, Bill Shannon wrote:
>>>>>> This still seems like an inconsistent, and inconvenient, approach to me.
>>>>>>
>>>>>> You've decided that some encoding errors (i.e., missing pad characters)
>>>>>> can be ignored.  You're willing to assume that the missing characters aren't
>>>>>> missing data but just missing padding.  But if you find a padding character
>>>>>> where you don't expect it you won't assume that the missing data is zero.
>>>>> "missing pad characters" in theory is not an encoding errors. As the RFC
>>>>> suggested, the
>>>>> use of padding in base64 data is not required or used. They mainly serve the
>>>>> purpose of
>>>>> providing the indication of "end of the data". This is why the padding
>>>>> character(s) is not
>>>>> required (optional) by our decoder at first place. However, if the padding
>>>>> character(s) is
>>>>> present, they need to be correctly encoded, otherwise, it's a malformed base64
>>>>> stream.
>>>> I think we're interpreting the spec differently.
>>> I meant to say "The RFC says the use of padding in base64 data is not required
>>> nor used, in some circumstances".
>>> I interpret it as the padding is optional in some circumstances.
>> It's never optional.  There's two specific cases in which it's required
>> and one specific case in which it is not present.
>
> My apology, It appears we are not talking about the same thing. What I'm
> trying to say is
> that whether or not to USE the padding characters "="  is optional for base
> encoding "FOR
> SOME CIRCUMSTANCES".  Maybe it's more clear to just cite the original wording here
>
>    In some circumstances, the use of padding ("=") in base encoded data
>    is not required nor used.  In the general case, when assumptions on
>    size of transported data cannot be made, padding is required to yield
>    correct decoded data.
>    Implementations MUST include appropriate pad characters at the end of
>    encoded data unless the specification referring to this document
>    explicitly states otherwise.
I don't know what you're quoting from, but that's not in RFC 2045 where base64
is defined for MIME.  RFC 2045 is pretty clear about when the padding character
must or must not be present.

> My interpretation is that it is possible for some types/styles of Base64
> implementation
> it is optional to not generate the "padding" character at the end of the
> encoding operation.
I think those would be non-MIME uses.

> Though the RFC requires if it does omitting the padding character, it need to
> explicitly
> specify this in its spec.
>
> When encoding the existing implementation, by default, always add the padding
> characters
> at the end of the encoded stream, if needed (for xx==, xxx=). Decoder is try
> to be "liberal"/
> lenient in what your accept (with the assumption is that the encoded may come
> from some
> encoder that not generate the padding characters), so it accept data with
> padding and
> dta without padding. However, it requires that if padding characters are used,
> it need
> to be CORRECTLY encoded. That was the original specification and implementation.
> Upon your original request, I made the compromise to give MIME type a more liberal
> spec/implementation for "incorrect" padding character combination as showed below
>
> Patterns of possible incorrectly encoded padding final base64 unit are:
>
>     xxxx =       unnecessary padding character at the end of encoded stream
>     xxxx xx=     missing the last padding character
>     xxxx xx=y    missing the last padding character, instead having a non-padding char
> Now it appears this compromise became part of your complain.
No, my complaint is that you missed one case "xxxx x".

> Our difference is that I believe the "padding character" is not part of the
> original
> data, we can be "liberal"/lenient for that. But "x===" (or simply a dangling "x")
> is missing part of the original data for decoding, I'm concerned about to be
> liberal on guessing what is missed.
Again, I can understand a "strict" decoding that detects all encoding errors and
a "lenient" decoding that ignores encoding errors, but you've got a
"half-lenient/half-strict" decoding that I don't think is useful.