Making the source code utf-8
Magnus Ihse Bursie
magnus.ihse.bursie at oracle.com
Tue Feb 7 14:15:42 UTC 2023
On 2023-02-07 14:07, Daniel Jeliński wrote:
> +1 to make the code build regardless of the user's environment / locale.
>
> Would it be possible to enforce ASCII by default, and allow UTF-8 in
> exceptional cases? This would give us one extra layer of protection
> against trojan sources [1]
ASCII-only certainly has it's advantages, yes, including protecting from
that kind of attacks.
I think we need to treat the entire code base as UTF-8, e.g. in terms of
what arguments we send to compilers.
With that said, we could extend jcheck to separately check if a file
contains non-ASCII characters, and deny such changes to be pushed.
The question then becomes: how do we handle exceptions? By having a
global "allow-list" containing filenames for files that are acceptable
to have non-ASCII characters? By requiring them to have a certain name
pattern? By inserting some kind of meta-data character sequence in them
that marks them as non-ASCII?
These are the only options I can think of, and none of them sound
attractive to me.
A better approach, I think, is to have some kind of jcheck "warning"
(not a blocker for integration) that warns you that you have non-ASCII
characters in the code you are about to check in. That will, hopefully,
be enough to fix unintentional introduction of e.g. typographic quotes,
or malicious attacks (given that reviewers are alert for such warnings).
/Magnus
>
> Regards,
> Daniel
>
> [1] https://trojansource.codes/
>
>
>
>
>
>
> wt., 7 lut 2023 o 13:28 Magnus Ihse Bursie
> <magnus.ihse.bursie at oracle.com> napisał(a):
>
> Currently, the source code in the JDK is in an ill-defined encoding.
> There is no official declaration of the encoding used. It is "mostly
> ASCII", but the relatively few non-ASCII characters used are not
> well-defined. In many cases, it is latin-1, but I am pretty certain
> other encodings are used for e.g. Asian translations.
>
> This is is creating unnecessary problems when working with the JDK
> code
> base, while providing no benefit. We ended up here not by choice,
> but by
> historical accident. Most recently, this issue has surfaced in
> JDK-8301853, JDK-8301854 and JDK-8301855, but there has popped up
> issues
> relating to this from time to time, e.g. JDK-8263028.
>
> As JEP 400[1] confirms, UTF-8 is the way to go. We should follow
> up on
> this by converting our code base to UTF-8.
>
> I have created JDK-8301971[2] with the intention of converting all
> files
> to UTF-8, and updating all infrastructure to recognize this fact.
>
> Even though 99.9% of all text in the JDK repository is ASCII only,
> with
> a code base the size of the JDK there are of course many, many
> instances
> that needs to be checked and/or converted. I can take care of the
> overarching issues, like updating compiler flags and develop
> tooling to
> detect, and try to convert non-ASCII files based on my best
> guesses, but
> in the end, there are likely to be many files which needs to be
> verified
> by their respective teams, so that I did not assume the incorrect
> source
> encoding.
>
> So, before I go ahead and start doing this, I want to check:
>
> * Is everyone onboard with this idea? I do assume that in 2023,
> having
> UTF-8 encoding for text files is (or should be) a no-brainer, but
> I want
> to verify that there is no-one opposing this.
>
> * Should I open a JEP for this? On the one hand, it is likely to
> require
> a non-trivial amount of work, but on the other hand, there is no
> change
> visible for the end user, so it will be kind of pointless to
> announce.
> For my part, I could go either way, so I'm interested in hearing
> opinions, preferably with good rationales, for one way or the other.
>
> /Magnus
>
> [1] https://openjdk.org/jeps/400
> [2] https://bugs.openjdk.org/browse/JDK-8301971
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/jdk-dev/attachments/20230207/7d454a6f/attachment-0001.htm>
More information about the jdk-dev
mailing list