Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

Uwe Schindler uschindler at apache.org
Wed Feb 21 08:53:54 UTC 2018


Hi,

> This draft JEP contains a proposal to use UTF-8 as the default charset
> for the JVM, so that
> APIs that depend on the default charset behave consistently cross all
> platforms.
> 
> For more details, please see:
> https://bugs.openjdk.java.net/browse/JDK-8187041

Thanks for finally adding a JEP like this. Thanks also to Robert Muir for always insisting in fixing this problem! I have a few comments:

The JEP should NOT cause that new APIs, which may convert between characters and bytes to no longer explicitly accept a charset. One example is the proposed ByteBuffer methods taking String. The default ones would work with UTF-8, but it should still be possible to an API user to always add a charset whenever there is a conversion between bytes and chars. This is especially important as the user may still change the default and breaking your app. Because the rule is still: Only YOU, the developer, know the charset of your stuff when you load a JAR resource file or pass a String to the network in a ByteBuffer!

The biggest offenders on this is also given as an example: FileReader and FileWriter. Although both classes subclass InputStreamReader/OutputStreamWriter and just pass the right delegate to the superclass in the ctor, both classes are missing the possibility to specify a charset. Because of this, the use of FileReader and FileWriter is completely forbidden in many Apache projects (Apache Lucene, Solr, Elasticsearch, Apache TIKA,...). So I'd suggest to also fix the API here and just add the missing ctors.

The Java 7+ methods in java.nio.file.Files already ignore the default charset and always use UTF-8. How to proceed with those? Should they be changed to behave to the new mechanisms? I'd suggest to not do this, as its part of the spec (to use UTF-8) and should not rely on external forces, but I wanted to bring this in.

Changing the default would help many users, if they are actually using newer JDKs. For those with older versions (and compiling their code against older versions), you still have to avoid the default charsets. In addition, as you still can change the "default charset", any library developer reading resources from its own JAR file or passing Strings to network protocols cannot rely on the fact, that the default charset is really UTF-8! (a user may have changed it to something else). Because of this, Apache libraries will forbid usage of all methods using default charsets (and locales + timezones). The "changeable default" does not affect application developers (because they have in most cases control about the environment), but library developers should always be explicit!

For this to work, I also want to do some "advertisement": All library projects should use the Forbidden-Apis Maven/Gradle/Ant plugin to scan their bytecode for offenders using default charsets, default locales or relying on default timezones. See the blog post about it [1] and the project page [2]. The tool is also useful to replace "jdeps" in projects with Java versions before 8, as it can scan your code for access to internal JDK APIs, too. See the documentation [3] and github wiki pages for useful examples. It may also be a good idea to mention it in the JEP as a "workaround" or "further reading".

Finally: Because one can still change the default, I'd propose to deprecate all methods that use a default charset (unrelated to actually changing the default). Only if you do this, it would make tools like "forbiddenapis" irrelevant for library developers.

And finally, finally: I'd also propose to change the default Locale to Locale.ROOT (same issues). The String.toLowerCase() in Turkish locales still break thousands of apps! But that's a different JEP - but I would strongly support it!

Uwe

[1] http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html
[2] https://github.com/policeman-tools/forbidden-apis
[3] https://jenkins.thetaphi.de/job/Forbidden-APIs/javadoc/

-----
Uwe Schindler
uschindler at apache.org 
ASF Member, Apache Lucene PMC / Committer
Bremen, Germany
http://lucene.apache.org/




More information about the core-libs-dev mailing list