Difference in encoding semantics of URI returned by File.toURI and Path.toUri representing the same file

Daniel Fuchs daniel.fuchs at oracle.com
Wed Jan 13 16:50:37 UTC 2021


Hi Jaikiran,

java.net.URI doesn't require encoding non ascii characters, but
calling URI.toASCIIString() will ensure that they get encoded.

The two URIs, with non ASCII characters encoded or not encoded,
should be equivalent.

On 18/12/2020 04:50, Jaikiran Pai wrote:
> 
> URI from Paths.get().toUri() API /private/tmp/delme/foobãr ---> 
> file:///private/tmp/delme/fooba%CC%83r/

This looks like what you would obtain by calling URI.toASCIIString(),
except it's not exactly the same:

URI.create("file:///private/tmp/delme/foobãr/").toASCIIString()
---> file:///private/tmp/delme/foob%C3%A3r/

One form ("%C3%A3") is the UTF-8 encoding of the "ã" unicode
character (U+00E3);
The other form ("a%80%83") is the combination of lower case "a"
followed by the combining "~" character ("a" + U+0303) which is another
way of encoding the "ã" glyph in Unicode.

This is because URI.toASCIIString() uses NFC before encoding the UTF-8
sequence representing "ã". Obviously Path.toURI() does not do that.
It's not clear to me whether NFC should be applied - or what would be
the consequences of applying NFC there (or not).

Both encodings seems to be working - if you feed the encoded string
to new URL(string).openConnection();

> URI from File.toPath().toUri() API /private/tmp/delme/foobãr ---> 
> file:///private/tmp/delme/fooba%CC%83r/
> URI from File.toURI() API /private/tmp/delme/foobãr ---> 
> file:/private/tmp/delme/foobãr/

best regards,

-- daniel


More information about the core-libs-dev mailing list