RFR: 8311216: DataURI can lose information in some charset environments
Andy Goryachev
angorya at openjdk.org
Fri Jul 7 20:50:59 UTC 2023
On Fri, 7 Jul 2023 20:23:17 GMT, Michael Strauß <mstrauss at openjdk.org> wrote:
>> From https://datatracker.ietf.org/doc/html/rfc3986#page-11
>>
>>
>> Therefore, the
>>
>>
>>
>>
>>
>> Berners-Lee, et al. Standards Track [Page 11]
>>
>> [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986) URI Generic Syntax January 2005
>>
>>
>> integer values used by the ABNF must be mapped back to their
>> corresponding characters via US-ASCII in order to complete the syntax
>> rules.
>
>> I wonder if this is all necessary. The data is supposed to be url-encoded, so it's essentially ASCII, no?
>>
>> passing default charset to getBytes() is not right, it probably should be
>>
>> URLDecoder.decode(data.replace("+", "%2B"), charset).getBytes(StandardCharsets.US_ASCII));
>>
>> or am I missing something?
>
> The payload of a data URI is just a sequence of bytes, not characters. Only when the numeric value of a byte, assuming ASCII encoding, is a *safe URL character*, it is left as-is; otherwise percent-encoding is used to encode the byte value. The [specification](https://datatracker.ietf.org/doc/html/rfc2397) points out:
>
> Without ";base64", the data (as a sequence of octets) is represented using
> ASCII encoding for octets inside the range of safe URL characters and using
> the standard %xx hex encoding of URLs for octets outside that range.
>
>
> Decoding the payload back to a byte array is done by simply converting each assumed ASCII character to its numeric value, and decoding percent-encoded bytes to their hex value. Note that the assumed ASCII encoding only refers to the URL, but not to the payload. The payload is not a string, and it doesn't contain characters; it's a sequence of bytes.
>
> `URLDecoder` is not a general-purpose class to decode a percent-encoded sequence of bytes. It's specifically meant to take a HTML forms string and decode it into a string with some defined charset, using additional rules that don't generally apply to percent-encoded byte sequences. For example, a space character is encoded as `+` (that's where the kind-of-hacky `data.replace("+", "%2B")` comes from).
>
> Using `URLDecoder` kind of works (if we use a sufficiently rich charset for both `URLDecoder.decode` and `String.getBytes`), but only by accident. While it accepts almost any percent-encoded data, the Javadoc for `URLDecoder` says:
>
> There are two possible ways in which this decoder could deal with illegal strings.
> It could either leave illegal characters alone or it could throw an IllegalArgumentException.
> Which approach the decoder takes is left to the implementation.
thank you for clarifications! your approach does make sense.
-------------
PR Review Comment: https://git.openjdk.org/jfx/pull/1165#discussion_r1256454118
More information about the openjfx-dev
mailing list