RFR: 8260265: UTF-8 by Default

Giacomo Baso github.com+12575901+gbaso at openjdk.java.net
Wed Jul 14 12:45:17 UTC 2021


On Thu, 8 Jul 2021 21:23:00 GMT, Naoto Sato <naoto at openjdk.org> wrote:

> This is an implementation for the `JEP 400: UTF-8 by Default`. The gist of the changes is `Charset.defaultCharset()` returning `UTF-8` and `file.encoding` system property being added in the spec, but another notable modification is in `java.io.PrintStream` where it continues to use the `Console` encoding as the default charset instead of `UTF-8`. Other changes are mostly clarification of the term "default charset" and their links. Corresponding CSR has also been drafted.
> 
> JEP 400: https://bugs.openjdk.java.net/browse/JDK-8187041
> CSR: https://bugs.openjdk.java.net/browse/JDK-8260266

> Consider an application that creates a java.io.FileWriter with its one-argument constructor and then uses it to write some text to a file. The resulting file will contain a sequence of bytes encoded using the default charset of the JDK running the application. A second application, run on a different machine or by a different user on the same machine, creates a java.io.FileReader with its one-argument constructor and uses it to read the bytes in that file. The resulting text contains a sequence of characters decoded using the default charset of the JDK running the second application. If the default charset differs between the JDK of the first application and the JDK of the second application, then the resulting text may be silently corrupted or incomplete, since these APIs replace erroneous input rather than fail.

It's even worse than that, because many OpenSSH installs are configured by default to [forward](https://man.openbsd.org/ssh_config.5#SendEnv) and [accept](https://man.openbsd.org/sshd_config.5#AcceptEnv) the user locale (see e.g. for [RHEL 7](https://access.redhat.com/solutions/974273)).

So a single application, on a single remote machine, can be unknowingly started by a single user with different locales, and therefore different encodings, depending on how the user connected to the remote machine. For example, on Windows connecting via powershell results in `LANG=en_US.UTF-8`, while using WSL2 results in `LANG=C.UTF-8`. On Java 11 in a RHEL7 machine, `file.encoding` results in `UTF-8` in the first case, but `ANSI_X3.4-1968` in the second, leading to a default charset `ASCII`.

Worth mentioning is also that `Charset.forName("default")` is just an alias to `ASCII`, per `sun.nio.cs.StandardCharsets$Aliases`.

-------------

PR: https://git.openjdk.java.net/jdk/pull/4733


More information about the core-libs-dev mailing list