RFR: 8311216: DataURI can lose information in some charset environments

Fri Jul 7 20:25:59 UTC 2023

On Fri, 7 Jul 2023 19:17:59 GMT, Andy Goryachev <angorya at openjdk.org> wrote:

>> modules/javafx.graphics/src/main/java/com/sun/javafx/util/DataURI.java line 115:
>> 
>>> 113:             nameValuePairs,
>>> 114:             base64,
>>> 115:             base64 ? Base64.getDecoder().decode(data) : decodePercentEncoding(data));
>> 
>> I wonder if this is all necessary.  The data is supposed to be url-encoded, so it's essentially ASCII, no?
>> 
>> passing default charset to getBytes() is not right, it probably should be
>> 
>> URLDecoder.decode(data.replace("+", "%2B"), charset).getBytes(StandardCharsets.US_ASCII));
>> 
>> or am I missing something?
>
> From https://datatracker.ietf.org/doc/html/rfc3986#page-11
> 
> 
> Therefore, the
> 
> 
> 
> 
> 
> Berners-Lee, et al.         Standards Track                    [Page 11]
> 
> [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986)                   URI Generic Syntax               January 2005
> 
> 
>    integer values used by the ABNF must be mapped back to their
>    corresponding characters via US-ASCII in order to complete the syntax
>    rules.

> I wonder if this is all necessary. The data is supposed to be url-encoded, so it's essentially ASCII, no?
> 
> passing default charset to getBytes() is not right, it probably should be
> 
> URLDecoder.decode(data.replace("+", "%2B"), charset).getBytes(StandardCharsets.US_ASCII));
> 
> or am I missing something?

The payload of a data URI is just a sequence of bytes, not characters. Only when the numeric value of a byte, assuming ASCII encoding, is a *safe URL character*, it is left as-is; otherwise percent-encoding is used to encode the byte value. The [specification](https://datatracker.ietf.org/doc/html/rfc2397) points out:

Without ";base64", the data (as a sequence of octets) is represented using
ASCII encoding for octets inside the range of safe URL characters and using
the standard %xx hex encoding of URLs for octets outside that range.

Decoding the payload back to a byte array is done by simply converting each assumed ASCII character to its numeric value, and decoding percent-encoded bytes to their hex value. Note that the assumed ASCII encoding only refers to the URL, but not to the payload. The payload is not a string, and it doesn't contain characters; it's a sequence of bytes.

`URLDecoder` is not a general-purpose class to decode a percent-encoded sequence of bytes. It's specifically meant to take a HTML forms string and decode it into a string with some defined charset, using additional rules that don't generally apply to percent-encoded byte sequences. For example, a space character is encoded as `+` (that's where the kind-of-hacky `data.replace("+", "%2B")` comes from).

Using `URLDecoder` kind of works (if we use a sufficiently rich charset for both `URLDecoder.decode` and `String.getBytes`), but only by accident. While it accepts almost any percent-encoded data, the Javadoc for `URLDecoder` says:

There are two possible ways in which this decoder could deal with illegal strings.
It could either leave illegal characters alone or it could throw an IllegalArgumentException.
Which approach the decoder takes is left to the implementation.

-------------

PR Review Comment: https://git.openjdk.org/jfx/pull/1165#discussion_r1256426290