RFR: 8311216: DataURI can lose information in some charset environments

Fri Jul 7 20:50:59 UTC 2023

On Fri, 7 Jul 2023 20:23:17 GMT, Michael Strauß <mstrauss at openjdk.org> wrote:

>> From https://datatracker.ietf.org/doc/html/rfc3986#page-11
>> 
>> 
>> Therefore, the
>> 
>> 
>> 
>> 
>> 
>> Berners-Lee, et al.         Standards Track                    [Page 11]
>> 
>> [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986)                   URI Generic Syntax               January 2005
>> 
>> 
>>    integer values used by the ABNF must be mapped back to their
>>    corresponding characters via US-ASCII in order to complete the syntax
>>    rules.
>
>> I wonder if this is all necessary. The data is supposed to be url-encoded, so it's essentially ASCII, no?
>> 
>> passing default charset to getBytes() is not right, it probably should be
>> 
>> URLDecoder.decode(data.replace("+", "%2B"), charset).getBytes(StandardCharsets.US_ASCII));
>> 
>> or am I missing something?
> 
> The payload of a data URI is just a sequence of bytes, not characters. Only when the numeric value of a byte, assuming ASCII encoding, is a *safe URL character*, it is left as-is; otherwise percent-encoding is used to encode the byte value. The [specification](https://datatracker.ietf.org/doc/html/rfc2397) points out:
> 
> Without ";base64", the data (as a sequence of octets) is represented using
> ASCII encoding for octets inside the range of safe URL characters and using
> the standard %xx hex encoding of URLs for octets outside that range.
> 
> 
> Decoding the payload back to a byte array is done by simply converting each assumed ASCII character to its numeric value, and decoding percent-encoded bytes to their hex value. Note that the assumed ASCII encoding only refers to the URL, but not to the payload. The payload is not a string, and it doesn't contain characters; it's a sequence of bytes.
> 
> `URLDecoder` is not a general-purpose class to decode a percent-encoded sequence of bytes. It's specifically meant to take a HTML forms string and decode it into a string with some defined charset, using additional rules that don't generally apply to percent-encoded byte sequences. For example, a space character is encoded as `+` (that's where the kind-of-hacky `data.replace("+", "%2B")` comes from).
> 
> Using `URLDecoder` kind of works (if we use a sufficiently rich charset for both `URLDecoder.decode` and `String.getBytes`), but only by accident. While it accepts almost any percent-encoded data, the Javadoc for `URLDecoder` says:
> 
> There are two possible ways in which this decoder could deal with illegal strings.
> It could either leave illegal characters alone or it could throw an IllegalArgumentException.
> Which approach the decoder takes is left to the implementation.

thank you for clarifications!  your approach does make sense.

-------------

PR Review Comment: https://git.openjdk.org/jfx/pull/1165#discussion_r1256454118