Making the source code utf-8

Raffaello Giulietti raffaello.giulietti at oracle.com
Tue Feb 7 12:50:25 UTC 2023


I would welcome this change.

A later discussion may focus on whether a BOM would be helpful and/or required, even if it somehow contrasts with recommendations.

Raffaello

From: jdk-dev <jdk-dev-retn at openjdk.org> on behalf of Magnus Ihse Bursie <magnus.ihse.bursie at oracle.com>
Date: Tuesday, 7 February 2023 at 13:28
To: jdk-dev at openjdk.org <jdk-dev at openjdk.org>
Subject: Making the source code utf-8
Currently, the source code in the JDK is in an ill-defined encoding.
There is no official declaration of the encoding used. It is "mostly
ASCII", but the relatively few non-ASCII characters used are not
well-defined. In many cases, it is latin-1, but I am pretty certain
other encodings are used for e.g. Asian translations.

This is is creating unnecessary problems when working with the JDK code
base, while providing no benefit. We ended up here not by choice, but by
historical accident. Most recently, this issue has surfaced in
JDK-8301853, JDK-8301854 and JDK-8301855, but there has popped up issues
relating to this from time to time, e.g. JDK-8263028.

As JEP 400[1] confirms, UTF-8 is the way to go. We should follow up on
this by converting our code base to UTF-8.

I have created JDK-8301971[2] with the intention of converting all files
to UTF-8, and updating all infrastructure to recognize this fact.

Even though 99.9% of all text in the JDK repository is ASCII only, with
a code base the size of the JDK there are of course many, many instances
that needs to be checked and/or converted. I can take care of the
overarching issues, like updating compiler flags and develop tooling to
detect, and try to convert non-ASCII files based on my best guesses, but
in the end, there are likely to be many files which needs to be verified
by their respective teams, so that I did not assume the incorrect source
encoding.

So, before I go ahead and start doing this, I want to check:

* Is everyone onboard with this idea? I do assume that in 2023, having
UTF-8 encoding for text files is (or should be) a no-brainer, but I want
to verify that there is no-one opposing this.

* Should I open a JEP for this? On the one hand, it is likely to require
a non-trivial amount of work, but on the other hand, there is no change
visible for the end user, so it will be kind of pointless to announce.
For my part, I could go either way, so I'm interested in hearing
opinions, preferably with good rationales, for one way or the other.

/Magnus

[1] https://openjdk.org/jeps/400
[2] https://bugs.openjdk.org/browse/JDK-8301971

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/jdk-dev/attachments/20230207/9b71095d/attachment-0001.htm>


More information about the jdk-dev mailing list