<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>On 2023-02-07 14:07, Daniel Jeliński wrote:<br>
</p>
<blockquote type="cite" cite="mid:CAMrH03KDwJ+9vbhQ8EQG6nkuYcuCyK0BqB7AOmtmAaCAQL716Q@mail.gmail.com">
<div dir="ltr">+1 to make the code build regardless of the user's
environment / locale.<br>
<div><br>
</div>
<div>Would it be possible to enforce ASCII by default, and allow
UTF-8 in exceptional cases? This would give us one extra layer
of protection against trojan sources [1]</div>
</div>
</blockquote>
<p>ASCII-only certainly has it's advantages, yes, including
protecting from that kind of attacks. <br>
</p>
<p>I think we need to treat the entire code base as UTF-8, e.g. in
terms of what arguments we send to compilers. <br>
</p>
<p>With that said, we could extend jcheck to separately check if a
file contains non-ASCII characters, and deny such changes to be
pushed. <br>
</p>
<p>The question then becomes: how do we handle exceptions? By having
a global "allow-list" containing filenames for files that are
acceptable to have non-ASCII characters? By requiring them to have
a certain name pattern? By inserting some kind of meta-data
character sequence in them that marks them as non-ASCII?</p>
<p>These are the only options I can think of, and none of them sound
attractive to me.<br>
</p>
<p>A better approach, I think, is to have some kind of jcheck
"warning" (not a blocker for integration) that warns you that you
have non-ASCII characters in the code you are about to check in.
That will, hopefully, be enough to fix unintentional introduction
of e.g. typographic quotes, or malicious attacks (given that
reviewers are alert for such warnings).</p>
<p>/Magnus<br>
</p>
<p><br>
</p>
<p><br>
</p>
<blockquote type="cite" cite="mid:CAMrH03KDwJ+9vbhQ8EQG6nkuYcuCyK0BqB7AOmtmAaCAQL716Q@mail.gmail.com">
<div dir="ltr">
<div><br>
</div>
<div>Regards,</div>
<div>Daniel</div>
<div><br>
</div>
<div>[1] <a href="https://trojansource.codes/" moz-do-not-send="true" class="moz-txt-link-freetext">https://trojansource.codes/</a></div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">wt., 7 lut 2023 o 13:28 Magnus
Ihse Bursie <<a href="mailto:magnus.ihse.bursie@oracle.com" moz-do-not-send="true" class="moz-txt-link-freetext">magnus.ihse.bursie@oracle.com</a>>
napisał(a):<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Currently,
the source code in the JDK is in an ill-defined encoding. <br>
There is no official declaration of the encoding used. It is
"mostly <br>
ASCII", but the relatively few non-ASCII characters used are
not <br>
well-defined. In many cases, it is latin-1, but I am pretty
certain <br>
other encodings are used for e.g. Asian translations.<br>
<br>
This is is creating unnecessary problems when working with the
JDK code <br>
base, while providing no benefit. We ended up here not by
choice, but by <br>
historical accident. Most recently, this issue has surfaced in
<br>
JDK-8301853, JDK-8301854 and JDK-8301855, but there has popped
up issues <br>
relating to this from time to time, e.g. JDK-8263028.<br>
<br>
As JEP 400[1] confirms, UTF-8 is the way to go. We should
follow up on <br>
this by converting our code base to UTF-8.<br>
<br>
I have created JDK-8301971[2] with the intention of converting
all files <br>
to UTF-8, and updating all infrastructure to recognize this
fact.<br>
<br>
Even though 99.9% of all text in the JDK repository is ASCII
only, with <br>
a code base the size of the JDK there are of course many, many
instances <br>
that needs to be checked and/or converted. I can take care of
the <br>
overarching issues, like updating compiler flags and develop
tooling to <br>
detect, and try to convert non-ASCII files based on my best
guesses, but <br>
in the end, there are likely to be many files which needs to
be verified <br>
by their respective teams, so that I did not assume the
incorrect source <br>
encoding.<br>
<br>
So, before I go ahead and start doing this, I want to check:<br>
<br>
* Is everyone onboard with this idea? I do assume that in
2023, having <br>
UTF-8 encoding for text files is (or should be) a no-brainer,
but I want <br>
to verify that there is no-one opposing this.<br>
<br>
* Should I open a JEP for this? On the one hand, it is likely
to require <br>
a non-trivial amount of work, but on the other hand, there is
no change <br>
visible for the end user, so it will be kind of pointless to
announce. <br>
For my part, I could go either way, so I'm interested in
hearing <br>
opinions, preferably with good rationales, for one way or the
other.<br>
<br>
/Magnus<br>
<br>
[1] <a href="https://openjdk.org/jeps/400" rel="noreferrer" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://openjdk.org/jeps/400</a><br>
[2] <a href="https://bugs.openjdk.org/browse/JDK-8301971" rel="noreferrer" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://bugs.openjdk.org/browse/JDK-8301971</a><br>
<br>
<br>
</blockquote>
</div>
</blockquote>
</body>
</html>