Making the source code utf-8

Magnus Ihse Bursie magnus.ihse.bursie at oracle.com
Tue Feb 7 12:28:11 UTC 2023


Currently, the source code in the JDK is in an ill-defined encoding. 
There is no official declaration of the encoding used. It is "mostly 
ASCII", but the relatively few non-ASCII characters used are not 
well-defined. In many cases, it is latin-1, but I am pretty certain 
other encodings are used for e.g. Asian translations.

This is is creating unnecessary problems when working with the JDK code 
base, while providing no benefit. We ended up here not by choice, but by 
historical accident. Most recently, this issue has surfaced in 
JDK-8301853, JDK-8301854 and JDK-8301855, but there has popped up issues 
relating to this from time to time, e.g. JDK-8263028.

As JEP 400[1] confirms, UTF-8 is the way to go. We should follow up on 
this by converting our code base to UTF-8.

I have created JDK-8301971[2] with the intention of converting all files 
to UTF-8, and updating all infrastructure to recognize this fact.

Even though 99.9% of all text in the JDK repository is ASCII only, with 
a code base the size of the JDK there are of course many, many instances 
that needs to be checked and/or converted. I can take care of the 
overarching issues, like updating compiler flags and develop tooling to 
detect, and try to convert non-ASCII files based on my best guesses, but 
in the end, there are likely to be many files which needs to be verified 
by their respective teams, so that I did not assume the incorrect source 
encoding.

So, before I go ahead and start doing this, I want to check:

* Is everyone onboard with this idea? I do assume that in 2023, having 
UTF-8 encoding for text files is (or should be) a no-brainer, but I want 
to verify that there is no-one opposing this.

* Should I open a JEP for this? On the one hand, it is likely to require 
a non-trivial amount of work, but on the other hand, there is no change 
visible for the end user, so it will be kind of pointless to announce. 
For my part, I could go either way, so I'm interested in hearing 
opinions, preferably with good rationales, for one way or the other.

/Magnus

[1] https://openjdk.org/jeps/400
[2] https://bugs.openjdk.org/browse/JDK-8301971




More information about the jdk-dev mailing list