Making the source code utf-8

Magnus Ihse Bursie magnus.ihse.bursie at oracle.com
Wed Feb 8 19:48:38 UTC 2023


On 2023-02-08 02:32, David Holmes wrote:

> Just to expand on the recent issues here. Limiting source files to 
> ascii precludes the ability of using non-ascii in comments  - 
> specifically symbols like smart-quotes, mathematical symbols, line 
> symbols for "ascii art" diagrams etc. I don't have an issue with that 
> as I'm more concerned about such symbols not being displayable in 
> basic editing environments.

Am I correct in hearing "my emacs setup" when you say "basic editing 
environments"? ;-)

I understand your concern, but I frankly believe most editing 
environments fully support UTF-8 these days, even a trivial emacs or vi 
installation in a terminal setup.

I do prefer to keep source code to plain ASCII unless necessary, even 
though I don't share your concerns over environment failing to display 
the code properly. So maybe we can agree on the goal here, even if not 
on the reasons. :-)

But if you have specific development setups that does not handle UTF-8 
correctly, I think we should try to work out a solution for those, and 
document it.

/Magnus
>
> Cheers,
> David
> -----
>
>> I guess we should be clear on whether this affects "all files in the 
>> repository" or "source code". I probably haven't been as precise as I 
>> should have been. It's one thing to declare to javac that source code 
>> is in UTF-8; this clearly affects only Java source files. I don't 
>> know the impact of making such a declaration to git, though, as that 
>> would apply to all files in the repo.
>>
>> As for a JEP, there is precedent for JEPs to cover things about the 
>> code base itself, such as the modular source layout (JEP 201) and 
>> Git/GitHub migration (JEP 357 and JEP 369). Once things settle out, 
>> it would be good to document a policy for what files must be in what 
>> encodings, and how this is enforced. This could be an informational 
>> JEP, if the overhead of publishing it isn't too high. (An alternative 
>> might be a README in the repo itself.)
>>
>> s'marks
>>
>>
>> On 2/7/23 6:15 AM, Magnus Ihse Bursie wrote:
>>
>>> On 2023-02-07 14:07, Daniel Jeliński wrote:
>>>
>>>> +1 to make the code build regardless of the user's environment / 
>>>> locale.
>>>>
>>>> Would it be possible to enforce ASCII by default, and allow UTF-8 
>>>> in exceptional cases? This would give us one extra layer of 
>>>> protection against trojan sources [1]
>>>
>>> ASCII-only certainly has it's advantages, yes, including protecting 
>>> from that kind of attacks.
>>>
>>> I think we need to treat the entire code base as UTF-8, e.g. in 
>>> terms of what arguments we send to compilers.
>>>
>>> With that said, we could extend jcheck to separately check if a file 
>>> contains non-ASCII characters, and deny such changes to be pushed.
>>>
>>> The question then becomes: how do we handle exceptions? By having a 
>>> global "allow-list" containing filenames for files that are 
>>> acceptable to have non-ASCII characters? By requiring them to have a 
>>> certain name pattern? By inserting some kind of meta-data character 
>>> sequence in them that marks them as non-ASCII?
>>>
>>> These are the only options I can think of, and none of them sound 
>>> attractive to me.
>>>
>>> A better approach, I think, is to have some kind of jcheck "warning" 
>>> (not a blocker for integration) that warns you that you have 
>>> non-ASCII characters in the code you are about to check in. That 
>>> will, hopefully, be enough to fix unintentional introduction of e.g. 
>>> typographic quotes, or malicious attacks (given that reviewers are 
>>> alert for such warnings).
>>>
>>> /Magnus
>>>
>>>
>>>
>>>>
>>>> Regards,
>>>> Daniel
>>>>
>>>> [1] https://trojansource.codes/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> wt., 7 lut 2023 o 13:28 Magnus Ihse Bursie 
>>>> <magnus.ihse.bursie at oracle.com> napisał(a):
>>>>
>>>>     Currently, the source code in the JDK is in an ill-defined 
>>>> encoding.
>>>>     There is no official declaration of the encoding used. It is 
>>>> "mostly
>>>>     ASCII", but the relatively few non-ASCII characters used are not
>>>>     well-defined. In many cases, it is latin-1, but I am pretty 
>>>> certain
>>>>     other encodings are used for e.g. Asian translations.
>>>>
>>>>     This is is creating unnecessary problems when working with the
>>>>     JDK code
>>>>     base, while providing no benefit. We ended up here not by choice,
>>>>     but by
>>>>     historical accident. Most recently, this issue has surfaced in
>>>>     JDK-8301853, JDK-8301854 and JDK-8301855, but there has popped up
>>>>     issues
>>>>     relating to this from time to time, e.g. JDK-8263028.
>>>>
>>>>     As JEP 400[1] confirms, UTF-8 is the way to go. We should follow
>>>>     up on
>>>>     this by converting our code base to UTF-8.
>>>>
>>>>     I have created JDK-8301971[2] with the intention of converting
>>>>     all files
>>>>     to UTF-8, and updating all infrastructure to recognize this fact.
>>>>
>>>>     Even though 99.9% of all text in the JDK repository is ASCII
>>>>     only, with
>>>>     a code base the size of the JDK there are of course many, many
>>>>     instances
>>>>     that needs to be checked and/or converted. I can take care of the
>>>>     overarching issues, like updating compiler flags and develop
>>>>     tooling to
>>>>     detect, and try to convert non-ASCII files based on my best
>>>>     guesses, but
>>>>     in the end, there are likely to be many files which needs to be
>>>>     verified
>>>>     by their respective teams, so that I did not assume the incorrect
>>>>     source
>>>>     encoding.
>>>>
>>>>     So, before I go ahead and start doing this, I want to check:
>>>>
>>>>     * Is everyone onboard with this idea? I do assume that in 2023,
>>>>     having
>>>>     UTF-8 encoding for text files is (or should be) a no-brainer, but
>>>>     I want
>>>>     to verify that there is no-one opposing this.
>>>>
>>>>     * Should I open a JEP for this? On the one hand, it is likely to
>>>>     require
>>>>     a non-trivial amount of work, but on the other hand, there is no
>>>>     change
>>>>     visible for the end user, so it will be kind of pointless to
>>>>     announce.
>>>>     For my part, I could go either way, so I'm interested in hearing
>>>>     opinions, preferably with good rationales, for one way or the 
>>>> other.
>>>>
>>>>     /Magnus
>>>>
>>>>     [1] https://openjdk.org/jeps/400
>>>>     [2] https://bugs.openjdk.org/browse/JDK-8301971
>>>>
>>>>


More information about the jdk-dev mailing list