Making the source code utf-8

Wed Feb 8 01:32:10 UTC 2023

On 8/02/2023 4:00 am, Stuart Marks wrote:
> Hi, working toward UTF-8 generally sounds like a good idea.
> 
> In particular, declaring to git and to javac that files are in UTF-8 
> makes good sense, in order to reduce undefined behavior.
> 
> I think the issue raised here is that we don't want to allow 
> uncontrolled use of non-ASCII UTF-8, so for the most part source files 
> should be ASCII. What files aren't ASCII? Certainly, localized 
> properties files are not. Are there others? Figuring out why files 
> aren't pure ASCII might help determine how to check and enforce some 
> policy. Also, what files aren't in UTF-8? That would be interesting to 
> know as well.
> 
> Using name patterns to enforce ASCII on a subset of files might not be 
> too bad. Files with suffixes like .java, .cpp, .hpp are clearly source 
> code and should be ASCII. (I'd be interested in hearing if there are any 
> exceptions to this rule.) Files with other extensions, such as 
> .properties or other various metadata, could be non-ASCII.

Just to expand on the recent issues here. Limiting source files to ascii 
precludes the ability of using non-ascii in comments  - specifically 
symbols like smart-quotes, mathematical symbols, line symbols for "ascii 
art" diagrams etc. I don't have an issue with that as I'm more concerned 
about such symbols not being displayable in basic editing environments.

Cheers,
David
-----

> I guess we should be clear on whether this affects "all files in the 
> repository" or "source code". I probably haven't been as precise as I 
> should have been. It's one thing to declare to javac that source code is 
> in UTF-8; this clearly affects only Java source files. I don't know the 
> impact of making such a declaration to git, though, as that would apply 
> to all files in the repo.
> 
> As for a JEP, there is precedent for JEPs to cover things about the code 
> base itself, such as the modular source layout (JEP 201) and Git/GitHub 
> migration (JEP 357 and JEP 369). Once things settle out, it would be 
> good to document a policy for what files must be in what encodings, and 
> how this is enforced. This could be an informational JEP, if the 
> overhead of publishing it isn't too high. (An alternative might be a 
> README in the repo itself.)
> 
> s'marks
> 
> 
> On 2/7/23 6:15 AM, Magnus Ihse Bursie wrote:
> 
>> On 2023-02-07 14:07, Daniel Jeliński wrote:
>>
>>> +1 to make the code build regardless of the user's environment / locale.
>>>
>>> Would it be possible to enforce ASCII by default, and allow UTF-8 in 
>>> exceptional cases? This would give us one extra layer of protection 
>>> against trojan sources [1]
>>
>> ASCII-only certainly has it's advantages, yes, including protecting 
>> from that kind of attacks.
>>
>> I think we need to treat the entire code base as UTF-8, e.g. in terms 
>> of what arguments we send to compilers.
>>
>> With that said, we could extend jcheck to separately check if a file 
>> contains non-ASCII characters, and deny such changes to be pushed.
>>
>> The question then becomes: how do we handle exceptions? By having a 
>> global "allow-list" containing filenames for files that are acceptable 
>> to have non-ASCII characters? By requiring them to have a certain name 
>> pattern? By inserting some kind of meta-data character sequence in 
>> them that marks them as non-ASCII?
>>
>> These are the only options I can think of, and none of them sound 
>> attractive to me.
>>
>> A better approach, I think, is to have some kind of jcheck "warning" 
>> (not a blocker for integration) that warns you that you have non-ASCII 
>> characters in the code you are about to check in. That will, 
>> hopefully, be enough to fix unintentional introduction of e.g. 
>> typographic quotes, or malicious attacks (given that reviewers are 
>> alert for such warnings).
>>
>> /Magnus
>>
>>
>>
>>>
>>> Regards,
>>> Daniel
>>>
>>> [1] https://trojansource.codes/
>>>
>>>
>>>
>>>
>>>
>>>
>>> wt., 7 lut 2023 o 13:28 Magnus Ihse Bursie 
>>> <magnus.ihse.bursie at oracle.com> napisał(a):
>>>
>>>     Currently, the source code in the JDK is in an ill-defined encoding.
>>>     There is no official declaration of the encoding used. It is "mostly
>>>     ASCII", but the relatively few non-ASCII characters used are not
>>>     well-defined. In many cases, it is latin-1, but I am pretty certain
>>>     other encodings are used for e.g. Asian translations.
>>>
>>>     This is is creating unnecessary problems when working with the
>>>     JDK code
>>>     base, while providing no benefit. We ended up here not by choice,
>>>     but by
>>>     historical accident. Most recently, this issue has surfaced in
>>>     JDK-8301853, JDK-8301854 and JDK-8301855, but there has popped up
>>>     issues
>>>     relating to this from time to time, e.g. JDK-8263028.
>>>
>>>     As JEP 400[1] confirms, UTF-8 is the way to go. We should follow
>>>     up on
>>>     this by converting our code base to UTF-8.
>>>
>>>     I have created JDK-8301971[2] with the intention of converting
>>>     all files
>>>     to UTF-8, and updating all infrastructure to recognize this fact.
>>>
>>>     Even though 99.9% of all text in the JDK repository is ASCII
>>>     only, with
>>>     a code base the size of the JDK there are of course many, many
>>>     instances
>>>     that needs to be checked and/or converted. I can take care of the
>>>     overarching issues, like updating compiler flags and develop
>>>     tooling to
>>>     detect, and try to convert non-ASCII files based on my best
>>>     guesses, but
>>>     in the end, there are likely to be many files which needs to be
>>>     verified
>>>     by their respective teams, so that I did not assume the incorrect
>>>     source
>>>     encoding.
>>>
>>>     So, before I go ahead and start doing this, I want to check:
>>>
>>>     * Is everyone onboard with this idea? I do assume that in 2023,
>>>     having
>>>     UTF-8 encoding for text files is (or should be) a no-brainer, but
>>>     I want
>>>     to verify that there is no-one opposing this.
>>>
>>>     * Should I open a JEP for this? On the one hand, it is likely to
>>>     require
>>>     a non-trivial amount of work, but on the other hand, there is no
>>>     change
>>>     visible for the end user, so it will be kind of pointless to
>>>     announce.
>>>     For my part, I could go either way, so I'm interested in hearing
>>>     opinions, preferably with good rationales, for one way or the other.
>>>
>>>     /Magnus
>>>
>>>     [1] https://openjdk.org/jeps/400
>>>     [2] https://bugs.openjdk.org/browse/JDK-8301971
>>>
>>>