Making the source code utf-8

Wed Feb 8 19:42:58 UTC 2023

On 2023-02-07 19:00, Stuart Marks wrote:

> Hi, working toward UTF-8 generally sounds like a good idea.
>
> In particular, declaring to git and to javac that files are in UTF-8 
> makes good sense, in order to reduce undefined behavior.
>
> I think the issue raised here is that we don't want to allow 
> uncontrolled use of non-ASCII UTF-8, so for the most part source files 
> should be ASCII. What files aren't ASCII? Certainly, localized 
> properties files are not. Are there others? Figuring out why files 
> aren't pure ASCII might help determine how to check and enforce some 
> policy. Also, what files aren't in UTF-8? That would be interesting to 
> know as well.
>
I think you hit the nail on the head. We only want "controlled" use of 
non-ASCII.

A first step towards this is understanding what files we have today that 
are non-ASCII. I made a script some years ago that went through the repo 
and reported non-ASCII files (this is not as trivial as it sounds, since 
with too much non-ASCII characters files start looking like binaries, 
and binary files are of course allowed to have non-ASCII characters in 
them). I need to dig up that script, make another pass over the code 
base, and classify what kind of non-ASCII files we have.

> Using name patterns to enforce ASCII on a subset of files might not be 
> too bad. Files with suffixes like .java, .cpp, .hpp are clearly source 
> code and should be ASCII. (I'd be interested in hearing if there are 
> any exceptions to this rule.) Files with other extensions, such as 
> .properties or other various metadata, could be non-ASCII. 

Well, yeah... I know straight away that there are test files that 
include "weird" characters in the source code, perhaps because they are 
testing localization stuff. So just banning anything but ASCII from 
.java files is unlikely to be a successful policy. Perhaps banning 
non-ASCII for src/**/*.java but not for test/**/*.java, but that 
immediately turned out to be a much more complicated rule. And what 
about java for build tools in make/**/*.java? Or perhaps Java snippets 
too needs to be excluded?

Let me get back with some hard figures of how things actually looks. 
That does not fully determine our policy (we must still have reasonable 
way to grow and change the code base), but it can indicate what kind of 
rules seem viable.

> I guess we should be clear on whether this affects "all files in the 
> repository" or "source code". I probably haven't been as precise as I 
> should have been. It's one thing to declare to javac that source code 
> is in UTF-8; this clearly affects only Java source files. I don't know 
> the impact of making such a declaration to git, though, as that would 
> apply to all files in the repo.
>
I think we can tell git about different files, if we want. There is a 
`.gitattributes` files, which I believe has a syntax similar to 
`.gitignore`. Currently it says:

*       -text

which mean that we treat all files as binary. (This is not really ideal, 
and is a combination of the problem with encodings, and a way to ensure 
we get unix linefeeds even on Windows.)

But, I think we should make sure all text files are properly UTF-8. I 
mean, there is really no reason *not* to do this -- it is just that you 
(and I!) are unsure about the consequences of this. But any potential 
for misunderstanding the encoding of a file is the potential of a build 
failure or a bug, so I think it's better to grab the bull by the horns 
and just fix it.

> As for a JEP, there is precedent for JEPs to cover things about the 
> code base itself, such as the modular source layout (JEP 201) and 
> Git/GitHub migration (JEP 357 and JEP 369). Once things settle out, it 
> would be good to document a policy for what files must be in what 
> encodings, and how this is enforced. This could be an informational 
> JEP, if the overhead of publishing it isn't too high. (An alternative 
> might be a README in the repo itself.)
>
So, what you are saying is, you can do a JEP but you can also just write 
a README? :-) I think this means the question of a JEP or not is still 
open. Or did you mean that the precedents implied that it would indeed 
be good to have a JEP about source code changes, and not just a 
post-factum informational JEP?

/Magnus

> s'marks
>
>
> On 2/7/23 6:15 AM, Magnus Ihse Bursie wrote:
>
>> On 2023-02-07 14:07, Daniel Jeliński wrote:
>>
>>> +1 to make the code build regardless of the user's environment / locale.
>>>
>>> Would it be possible to enforce ASCII by default, and allow UTF-8 in 
>>> exceptional cases? This would give us one extra layer of protection 
>>> against trojan sources [1]
>>
>> ASCII-only certainly has it's advantages, yes, including protecting 
>> from that kind of attacks.
>>
>> I think we need to treat the entire code base as UTF-8, e.g. in terms 
>> of what arguments we send to compilers.
>>
>> With that said, we could extend jcheck to separately check if a file 
>> contains non-ASCII characters, and deny such changes to be pushed.
>>
>> The question then becomes: how do we handle exceptions? By having a 
>> global "allow-list" containing filenames for files that are 
>> acceptable to have non-ASCII characters? By requiring them to have a 
>> certain name pattern? By inserting some kind of meta-data character 
>> sequence in them that marks them as non-ASCII?
>>
>> These are the only options I can think of, and none of them sound 
>> attractive to me.
>>
>> A better approach, I think, is to have some kind of jcheck "warning" 
>> (not a blocker for integration) that warns you that you have 
>> non-ASCII characters in the code you are about to check in. That 
>> will, hopefully, be enough to fix unintentional introduction of e.g. 
>> typographic quotes, or malicious attacks (given that reviewers are 
>> alert for such warnings).
>>
>> /Magnus
>>
>>
>>
>>>
>>> Regards,
>>> Daniel
>>>
>>> [1] https://trojansource.codes/
>>>
>>>
>>>
>>>
>>>
>>>
>>> wt., 7 lut 2023 o 13:28 Magnus Ihse Bursie 
>>> <magnus.ihse.bursie at oracle.com> napisał(a):
>>>
>>>     Currently, the source code in the JDK is in an ill-defined
>>>     encoding.
>>>     There is no official declaration of the encoding used. It is
>>>     "mostly
>>>     ASCII", but the relatively few non-ASCII characters used are not
>>>     well-defined. In many cases, it is latin-1, but I am pretty certain
>>>     other encodings are used for e.g. Asian translations.
>>>
>>>     This is is creating unnecessary problems when working with the
>>>     JDK code
>>>     base, while providing no benefit. We ended up here not by
>>>     choice, but by
>>>     historical accident. Most recently, this issue has surfaced in
>>>     JDK-8301853, JDK-8301854 and JDK-8301855, but there has popped
>>>     up issues
>>>     relating to this from time to time, e.g. JDK-8263028.
>>>
>>>     As JEP 400[1] confirms, UTF-8 is the way to go. We should follow
>>>     up on
>>>     this by converting our code base to UTF-8.
>>>
>>>     I have created JDK-8301971[2] with the intention of converting
>>>     all files
>>>     to UTF-8, and updating all infrastructure to recognize this fact.
>>>
>>>     Even though 99.9% of all text in the JDK repository is ASCII
>>>     only, with
>>>     a code base the size of the JDK there are of course many, many
>>>     instances
>>>     that needs to be checked and/or converted. I can take care of the
>>>     overarching issues, like updating compiler flags and develop
>>>     tooling to
>>>     detect, and try to convert non-ASCII files based on my best
>>>     guesses, but
>>>     in the end, there are likely to be many files which needs to be
>>>     verified
>>>     by their respective teams, so that I did not assume the
>>>     incorrect source
>>>     encoding.
>>>
>>>     So, before I go ahead and start doing this, I want to check:
>>>
>>>     * Is everyone onboard with this idea? I do assume that in 2023,
>>>     having
>>>     UTF-8 encoding for text files is (or should be) a no-brainer,
>>>     but I want
>>>     to verify that there is no-one opposing this.
>>>
>>>     * Should I open a JEP for this? On the one hand, it is likely to
>>>     require
>>>     a non-trivial amount of work, but on the other hand, there is no
>>>     change
>>>     visible for the end user, so it will be kind of pointless to
>>>     announce.
>>>     For my part, I could go either way, so I'm interested in hearing
>>>     opinions, preferably with good rationales, for one way or the other.
>>>
>>>     /Magnus
>>>
>>>     [1] https://openjdk.org/jeps/400
>>>     [2] https://bugs.openjdk.org/browse/JDK-8301971
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/jdk-dev/attachments/20230208/08a3f1bc/attachment-0001.htm>