Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

newer
[11] RFR: JDK-8198385: Remove...

older
[11] RFR JDK-8198249: Remove...

Xueming Shen

21 Feb 2018 21 Feb '18

6:31 a.m.

This draft JEP contains a proposal to use UTF-8 as the default charset for the JVM, so that APIs that depend on the default charset behave consistently cross all platforms. For more details, please see: https://bugs.openjdk.java.net/browse/JDK-8187041 Sherman

Show replies by date

Uwe Schindler

21 Feb 21 Feb

8:53 a.m.

Hi,

...

This draft JEP contains a proposal to use UTF-8 as the default charset for the JVM, so that APIs that depend on the default charset behave consistently cross all platforms.

For more details, please see: https://bugs.openjdk.java.net/browse/JDK-8187041

Thanks for finally adding a JEP like this. Thanks also to Robert Muir for always insisting in fixing this problem! I have a few comments: The JEP should NOT cause that new APIs, which may convert between characters and bytes to no longer explicitly accept a charset. One example is the proposed ByteBuffer methods taking String. The default ones would work with UTF-8, but it should still be possible to an API user to always add a charset whenever there is a conversion between bytes and chars. This is especially important as the user may still change the default and breaking your app. Because the rule is still: Only YOU, the developer, know the charset of your stuff when you load a JAR resource file or pass a String to the network in a ByteBuffer! The biggest offenders on this is also given as an example: FileReader and FileWriter. Although both classes subclass InputStreamReader/OutputStreamWriter and just pass the right delegate to the superclass in the ctor, both classes are missing the possibility to specify a charset. Because of this, the use of FileReader and FileWriter is completely forbidden in many Apache projects (Apache Lucene, Solr, Elasticsearch, Apache TIKA,...). So I'd suggest to also fix the API here and just add the missing ctors. The Java 7+ methods in java.nio.file.Files already ignore the default charset and always use UTF-8. How to proceed with those? Should they be changed to behave to the new mechanisms? I'd suggest to not do this, as its part of the spec (to use UTF-8) and should not rely on external forces, but I wanted to bring this in. Changing the default would help many users, if they are actually using newer JDKs. For those with older versions (and compiling their code against older versions), you still have to avoid the default charsets. In addition, as you still can change the "default charset", any library developer reading resources from its own JAR file or passing Strings to network protocols cannot rely on the fact, that the default charset is really UTF-8! (a user may have changed it to something else). Because of this, Apache libraries will forbid usage of all methods using default charsets (and locales + timezones). The "changeable default" does not affect application developers (because they have in most cases control about the environment), but library developers should always be explicit! For this to work, I also want to do some "advertisement": All library projects should use the Forbidden-Apis Maven/Gradle/Ant plugin to scan their bytecode for offenders using default charsets, default locales or relying on default timezones. See the blog post about it [1] and the project page [2]. The tool is also useful to replace "jdeps" in projects with Java versions before 8, as it can scan your code for access to internal JDK APIs, too. See the documentation [3] and github wiki pages for useful examples. It may also be a good idea to mention it in the JEP as a "workaround" or "further reading". Finally: Because one can still change the default, I'd propose to deprecate all methods that use a default charset (unrelated to actually changing the default). Only if you do this, it would make tools like "forbiddenapis" irrelevant for library developers. And finally, finally: I'd also propose to change the default Locale to Locale.ROOT (same issues). The String.toLowerCase() in Turkish locales still break thousands of apps! But that's a different JEP - but I would strongly support it! Uwe [1] http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html [2] https://github.com/policeman-tools/forbidden-apis [3] https://jenkins.thetaphi.de/job/Forbidden-APIs/javadoc/ ----- Uwe Schindler uschindler@apache.org ASF Member, Apache Lucene PMC / Committer Bremen, Germany http://lucene.apache.org/

Remi Forax

11:04 a.m.

I agree with Uwe, we should deprecate all methods/constructors that relies on the default charset. And we should do that before changing to use UTF-8 by default. Remi On February 21, 2018 8:53:54 AM UTC, Uwe Schindler <uschindler@apache.org> wrote:

...

Hi,

...
This draft JEP contains a proposal to use UTF-8 as the default charset for the JVM, so that APIs that depend on the default charset behave consistently cross all platforms.

For more details, please see: https://bugs.openjdk.java.net/browse/JDK-8187041

Thanks for finally adding a JEP like this. Thanks also to Robert Muir for always insisting in fixing this problem! I have a few comments:

The JEP should NOT cause that new APIs, which may convert between characters and bytes to no longer explicitly accept a charset. One example is the proposed ByteBuffer methods taking String. The default ones would work with UTF-8, but it should still be possible to an API user to always add a charset whenever there is a conversion between bytes and chars. This is especially important as the user may still change the default and breaking your app. Because the rule is still: Only YOU, the developer, know the charset of your stuff when you load a JAR resource file or pass a String to the network in a ByteBuffer!

The biggest offenders on this is also given as an example: FileReader and FileWriter. Although both classes subclass InputStreamReader/OutputStreamWriter and just pass the right delegate to the superclass in the ctor, both classes are missing the possibility to specify a charset. Because of this, the use of FileReader and FileWriter is completely forbidden in many Apache projects (Apache Lucene, Solr, Elasticsearch, Apache TIKA,...). So I'd suggest to also fix the API here and just add the missing ctors.

The Java 7+ methods in java.nio.file.Files already ignore the default charset and always use UTF-8. How to proceed with those? Should they be changed to behave to the new mechanisms? I'd suggest to not do this, as its part of the spec (to use UTF-8) and should not rely on external forces, but I wanted to bring this in.

Changing the default would help many users, if they are actually using newer JDKs. For those with older versions (and compiling their code against older versions), you still have to avoid the default charsets. In addition, as you still can change the "default charset", any library developer reading resources from its own JAR file or passing Strings to network protocols cannot rely on the fact, that the default charset is really UTF-8! (a user may have changed it to something else). Because of this, Apache libraries will forbid usage of all methods using default charsets (and locales + timezones). The "changeable default" does not affect application developers (because they have in most cases control about the environment), but library developers should always be explicit!

For this to work, I also want to do some "advertisement": All library projects should use the Forbidden-Apis Maven/Gradle/Ant plugin to scan their bytecode for offenders using default charsets, default locales or relying on default timezones. See the blog post about it [1] and the project page [2]. The tool is also useful to replace "jdeps" in projects with Java versions before 8, as it can scan your code for access to internal JDK APIs, too. See the documentation [3] and github wiki pages for useful examples. It may also be a good idea to mention it in the JEP as a "workaround" or "further reading".

Finally: Because one can still change the default, I'd propose to deprecate all methods that use a default charset (unrelated to actually changing the default). Only if you do this, it would make tools like "forbiddenapis" irrelevant for library developers.

And finally, finally: I'd also propose to change the default Locale to Locale.ROOT (same issues). The String.toLowerCase() in Turkish locales still break thousands of apps! But that's a different JEP - but I would strongly support it!

Uwe

[1] http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html [2] https://github.com/policeman-tools/forbidden-apis [3] https://jenkins.thetaphi.de/job/Forbidden-APIs/javadoc/

----- Uwe Schindler uschindler@apache.org ASF Member, Apache Lucene PMC / Committer Bremen, Germany http://lucene.apache.org/

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

Alan Bateman

2:10 p.m.

On 21/02/2018 08:53, Uwe Schindler wrote:

...

: The Java 7+ methods in java.nio.file.Files already ignore the default charset and always use UTF-8. How to proceed with those? Should they be changed to behave to the new mechanisms? I'd suggest to not do this, as its part of the spec (to use UTF-8) and should not rely on external forces, but I wanted to bring this in.

There is no proposal to change these methods. -Alan

Uwe Schindler

8:50 p.m.

Hi Alan,

...

...
The Java 7+ methods in java.nio.file.Files already ignore the default charset and always use UTF-8. How to proceed with those? Should they be changed to behave to the new mechanisms? I'd suggest to not do this, as its part of the spec (to use UTF-8) and should not rely on external forces, but I wanted to bring this in.

There is no proposal to change these methods.

Thanks for clarifying! I just wanted to mention this, because those methods are different, so you should at least think about it 😊 Uwe ----- Uwe Schindler uschindler@apache.org ASF Member, Apache Lucene PMC / Committer Bremen, Germany http://lucene.apache.org/

Alan Bateman

22 Feb 22 Feb

1:33 p.m.

On 21/02/2018 20:50, Uwe Schindler wrote:

...

: Thanks for clarifying! I just wanted to mention this, because those methods are different, so you should at least think about it 😊

These methods were deliberately specified to use UTF-8 and I don't think we should change them (changing them for a release or two would cause needless breakage of course). -Alan

David Lloyd

21 Feb 21 Feb

1:19 p.m.

I agree with Uwe and Remi; if the default is still changeable, the problem doesn't go away, it simply becomes slightly more insidious. On Wed, Feb 21, 2018 at 12:31 AM, Xueming Shen <xueming.shen@oracle.com> wrote:

...

This draft JEP contains a proposal to use UTF-8 as the default charset for the JVM, so that APIs that depend on the default charset behave consistently cross all platforms.

For more details, please see: https://bugs.openjdk.java.net/browse/JDK-8187041

Sherman

-- - DML

Alan Bateman

1:37 p.m.

On 21/02/2018 13:19, David Lloyd wrote:

...

I agree with Uwe and Remi; if the default is still changeable, the problem doesn't go away, it simply becomes slightly more insidious.

The proposal is to eventually get to the point that the default charset cannot be changed. It will take several releases to get there due to the potential compatibility impact. This draft JEP is the first step to switch to UTF-8 by default. A first step has to allow it be changed in order to keep some existing code/deployments working. Sorry this isn't clear in the JEP yet, there several clarifications to this JEP that haven't been included yet (on my list, I didn't realize it would be discussed here this week). -Alan

Stephen Colebourne

1:41 p.m.

On 21 February 2018 at 13:37, Alan Bateman <Alan.Bateman@oracle.com> wrote:

...

The proposal is to eventually get to the point that the default charset cannot be changed. It will take several releases to get there due to the potential compatibility impact.

This seems like a reasonable strategy to solve the problem. I also agree that all locations where a default charset is used need to have a method alongside that takes a CharSet, eg. FileWriter. Stephen

Alan Bateman

1:55 p.m.

On 21/02/2018 13:41, Stephen Colebourne wrote:

...

On 21 February 2018 at 13:37, Alan Bateman <Alan.Bateman@oracle.com> wrote:

...
The proposal is to eventually get to the point that the default charset cannot be changed. It will take several releases to get there due to the potential compatibility impact. This seems like a reasonable strategy to solve the problem.

I also agree that all locations where a default charset is used need to have a method alongside that takes a CharSet, eg. FileWriter.

Good progress was made via JDK-8183743 [1] in Java SE 10 to add constructors and methods that take a Charset and eliminate the historical inconsistencies. The issue of legacy FileReader/FileWriter is linked from that JIRA issue. -Alan [1] https://bugs.openjdk.java.net/browse/JDK-8183743

Robert Muir

2:26 p.m.

On Wed, Feb 21, 2018 at 8:55 AM, Alan Bateman <Alan.Bateman@oracle.com> wrote:

...

Good progress was made via JDK-8183743 [1] in Java SE 10 to add constructors and methods that take a Charset and eliminate the historical inconsistencies. The issue of legacy FileReader/FileWriter is linked from that JIRA issue.

Can we ensure we have CharsetDecoder/Encoder params too? There is unfortunately a huge difference between InputStreamReader(x, StandardCharsets.UTF_8) and InputStreamReader(x, StandardCharsets.UTF_8.newDecoder()). And the silent replacement of the "easier" one is probably not what most apps want.

Xueming Shen

6:16 p.m.

On 2/21/18, 6:26 AM, Robert Muir wrote:

...

On Wed, Feb 21, 2018 at 8:55 AM, Alan Bateman<Alan.Bateman@oracle.com> wrote:

...
Good progress was made via JDK-8183743 [1] in Java SE 10 to add constructors and methods that take a Charset and eliminate the historical inconsistencies. The issue of legacy FileReader/FileWriter is linked from that JIRA issue.

Can we ensure we have CharsetDecoder/Encoder params too? There is unfortunately a huge difference between InputStreamReader(x, StandardCharsets.UTF_8) and InputStreamReader(x, StandardCharsets.UTF_8.newDecoder()). And the silent replacement of the "easier" one is probably not what most apps want. Hi Robert,

Understood a silent replacement might not be the desired behavior in some use scenarios. Anymore details regarding what "most apps want" when there is/are malformed/unmappable? It appears the best the underneath de/encoder can do here is to throw an IOException. Given the caller of the Reader/Writer does not have the access to the bytes of the underlying stream src (reader)/dst(writer), there is in theory impossible to do anything to recover and continue without risking data loss. The assumption here is if you want to have a fine-grained control of the de/ encoding, you might want to work with the Input/OutStream/Channel + CharsetDe/Encoder instead of Reader/Writer. No, I'm not saying we can't do Reader(CharsetDecoder)/Writer(CharsetEncoder), just wanted to know what's the real use scenario and what's the better/ best choice here. -Sherman

Robert Muir

8:07 p.m.

On Wed, Feb 21, 2018 at 1:16 PM, Xueming Shen <xueming.shen@oracle.com> wrote:

...

Hi Robert,

Understood a silent replacement might not be the desired behavior in some use scenarios. Anymore details regarding what "most apps want" when there is/are malformed/unmappable? It appears the best the underneath de/encoder can do here is to throw an IOException. Given the caller of the Reader/Writer does not have the access to the bytes of the underlying stream src (reader)/dst(writer), there is in theory impossible to do anything to recover and continue without risking data loss. The assumption here is if you want to have a fine-grained control of the de/ encoding, you might want to work with the Input/OutStream/Channel + CharsetDe/Encoder instead of Reader/Writer.

No, I'm not saying we can't do Reader(CharsetDecoder)/Writer(CharsetEncoder), just wanted to know what's the real use scenario and what's the better/ best choice here.

I think the exception is the best default? This is the default behavior of python for example, unless you specifically ask for "replace" or "ignore".

...

...
...
b'\xFFabc'.decode("utf-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Its also the default behavior of 'iconv' command-line tool used for converting charsets, unless you pass additional options. $ iconv -f utf-8 -t utf-8 test2.mp4 ftypisomisomiso2avc1mp41e iconv: test2.mp4:1:26: cannot convert Unfortunately in java, when using Charset or String parameters, it gives silently replacement with \uFFFD, etc. Its necessary to pass a CharsetDecoder to get an exception that something went wrong. The current situation is especially confusing as there is nothing in the javadocs to indicate that the behavior of InputStreamReader(x, Charset) and InputStreamReader(x, String) differ substantially from InputStreamReader(x, CharsetDecoder) ! I think the Charset and String parameters should default to REPORT, so the behavior of all constructors are consistent. If you want to replace, you should have to ask for it. I think replacement has use-cases but they are more "expert", e.g. web-crawling and so on. In general, wrong bytes indicate a problem and it can be very difficult to debug these issues when java hides these problems by default...

Uwe Schindler

2:58 p.m.

Hi, Thanks Alan for the link to this issue about FileReader/Writer! Uwe ----- Uwe Schindler uschindler@apache.org ASF Member, Apache Lucene PMC / Committer Bremen, Germany http://lucene.apache.org/

...

-----Original Message----- From: core-libs-dev [mailto:core-libs-dev-bounces@openjdk.java.net] On Behalf Of Alan Bateman Sent: Wednesday, February 21, 2018 2:55 PM To: Stephen Colebourne <scolebourne@joda.org>; core-libs- dev@openjdk.java.net Subject: Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

On 21/02/2018 13:41, Stephen Colebourne wrote:

...
On 21 February 2018 at 13:37, Alan Bateman <Alan.Bateman@oracle.com> wrote:

...
The proposal is to eventually get to the point that the default charset cannot be changed. It will take several releases to get there due to the potential compatibility impact. This seems like a reasonable strategy to solve the problem.

I also agree that all locations where a default charset is used need to have a method alongside that takes a CharSet, eg. FileWriter.

Good progress was made via JDK-8183743 [1] in Java SE 10 to add constructors and methods that take a Charset and eliminate the historical inconsistencies. The issue of legacy FileReader/FileWriter is linked from that JIRA issue.

-Alan

[1] https://bugs.openjdk.java.net/browse/JDK-8183743

Volker Simonis

4:11 p.m.

Hi Sherman, the tricky part is really "sun.jnu.encoding" and how the VM interacts with the underlying OS. You may remember that we had an interesting discussion about this topic some time ago [1]. As far as I understand, the JEP doesn't plan to change the handling of "sun.jnu.encoding". So does this mean that the VM will still correctly start and work on system with a platform encoding different from UTF-8? I.e. will starting the VM from a path which contains characters in that special platform encoding or classpath/argument settings with characters in that special character encoding still work? If the answer will be yes (which I expect) maybe you could explain that a little more detailed in the JEP. I.e. the JEP should say that it changes the default encoding for the Java API classes and not the default encoding for natively accessing system resources. Maybe the JEP should also mention that "sun.jnu.encoding" is more or less a "read-only" property which can not be reliable set by the user on the command line (it's a chicken-egg problem: for the parsing of the command line we need the correct encoding, so it can not be reliably set on the command line). For these reasons the Summary "Use UTF-8 as the Java virtual machine's default charset ..." is a little misleading. Maybe you could rephrase to something like "Use UTF-8 as the default charset so that Java APIs that depend on the default charset behave consistently across all platforms." Thank you and best regards, Volker [1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-December/thread.ht... On Wed, Feb 21, 2018 at 7:31 AM, Xueming Shen <xueming.shen@oracle.com> wrote:

...

This draft JEP contains a proposal to use UTF-8 as the default charset for the JVM, so that APIs that depend on the default charset behave consistently cross all platforms.

For more details, please see: https://bugs.openjdk.java.net/browse/JDK-8187041

Sherman

Xueming Shen

4:42 p.m.

Hi Volker, Yes, the handing of sun.jnu.encoding will not be changed. It will remain as a read-only/informative system property. sun.jnu.encoding is really an implementation details (as well as file.encoding, though in this JEP file.encoding might be used to provide a mechanism to fallback to the current/old/existing behavior, so might become a public/official interface/ system property). From API perspective the Charset.defaultCharset() is the only place to obtain the "Java virtual machine's default charset". As Alan said in previous comment, clarifications will be included in the final version based on feedback/suggestion -Sherman On 2/21/18, 8:11 AM, Volker Simonis wrote:

...

Hi Sherman,

the tricky part is really "sun.jnu.encoding" and how the VM interacts with the underlying OS. You may remember that we had an interesting discussion about this topic some time ago [1].

As far as I understand, the JEP doesn't plan to change the handling of "sun.jnu.encoding". So does this mean that the VM will still correctly start and work on system with a platform encoding different from UTF-8? I.e. will starting the VM from a path which contains characters in that special platform encoding or classpath/argument settings with characters in that special character encoding still work? If the answer will be yes (which I expect) maybe you could explain that a little more detailed in the JEP. I.e. the JEP should say that it changes the default encoding for the Java API classes and not the default encoding for natively accessing system resources.

Maybe the JEP should also mention that "sun.jnu.encoding" is more or less a "read-only" property which can not be reliable set by the user on the command line (it's a chicken-egg problem: for the parsing of the command line we need the correct encoding, so it can not be reliably set on the command line).

For these reasons the Summary "Use UTF-8 as the Java virtual machine's default charset ..." is a little misleading. Maybe you could rephrase to something like "Use UTF-8 as the default charset so that Java APIs that depend on the default charset behave consistently across all platforms."

Thank you and best regards, Volker

[1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-December/thread.ht...

On Wed, Feb 21, 2018 at 7:31 AM, Xueming Shen<xueming.shen@oracle.com> wrote:

...
This draft JEP contains a proposal to use UTF-8 as the default charset for the JVM, so that APIs that depend on the default charset behave consistently cross all platforms.

For more details, please see: https://bugs.openjdk.java.net/browse/JDK-8187041

Sherman

2937

Age (days ago)

2938

Last active (days ago)

List overview

Download

15 comments

8 participants

participants (8)

Alan Bateman
David Lloyd
Remi Forax
Robert Muir
Stephen Colebourne
Uwe Schindler
Volker Simonis
Xueming Shen

Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

tags

participants (8)