Making the source code utf-8

David Holmes david.holmes at oracle.com
Thu Feb 9 02:27:14 UTC 2023


On 9/02/2023 5:48 am, Magnus Ihse Bursie wrote:
> On 2023-02-08 02:32, David Holmes wrote:
> 
>> Just to expand on the recent issues here. Limiting source files to 
>> ascii precludes the ability of using non-ascii in comments  - 
>> specifically symbols like smart-quotes, mathematical symbols, line 
>> symbols for "ascii art" diagrams etc. I don't have an issue with that 
>> as I'm more concerned about such symbols not being displayable in 
>> basic editing environments.
> 
> Am I correct in hearing "my emacs setup" when you say "basic editing 
> environments"? ;-)

Yep emacs or vi/vim in a terminal aren't displaying these characters. 
I've googled the problem for vim and nothing I set seems to solve it. 
Cat'ing to the terminal also won't display them correctly.

I've no doubt there is some solution to this but I have better things to 
do with my time than try and find it. Source code files are not word 
processing documents, we don't need fancy font symbols being rendered.

Cheers,
David

> I understand your concern, but I frankly believe most editing 
> environments fully support UTF-8 these days, even a trivial emacs or vi 
> installation in a terminal setup.
> 
> I do prefer to keep source code to plain ASCII unless necessary, even 
> though I don't share your concerns over environment failing to display 
> the code properly. So maybe we can agree on the goal here, even if not 
> on the reasons. :-)
> 
> But if you have specific development setups that does not handle UTF-8 
> correctly, I think we should try to work out a solution for those, and 
> document it.
> 
> /Magnus
>>
>> Cheers,
>> David
>> -----
>>
>>> I guess we should be clear on whether this affects "all files in the 
>>> repository" or "source code". I probably haven't been as precise as I 
>>> should have been. It's one thing to declare to javac that source code 
>>> is in UTF-8; this clearly affects only Java source files. I don't 
>>> know the impact of making such a declaration to git, though, as that 
>>> would apply to all files in the repo.
>>>
>>> As for a JEP, there is precedent for JEPs to cover things about the 
>>> code base itself, such as the modular source layout (JEP 201) and 
>>> Git/GitHub migration (JEP 357 and JEP 369). Once things settle out, 
>>> it would be good to document a policy for what files must be in what 
>>> encodings, and how this is enforced. This could be an informational 
>>> JEP, if the overhead of publishing it isn't too high. (An alternative 
>>> might be a README in the repo itself.)
>>>
>>> s'marks
>>>
>>>
>>> On 2/7/23 6:15 AM, Magnus Ihse Bursie wrote:
>>>
>>>> On 2023-02-07 14:07, Daniel Jeliński wrote:
>>>>
>>>>> +1 to make the code build regardless of the user's environment / 
>>>>> locale.
>>>>>
>>>>> Would it be possible to enforce ASCII by default, and allow UTF-8 
>>>>> in exceptional cases? This would give us one extra layer of 
>>>>> protection against trojan sources [1]
>>>>
>>>> ASCII-only certainly has it's advantages, yes, including protecting 
>>>> from that kind of attacks.
>>>>
>>>> I think we need to treat the entire code base as UTF-8, e.g. in 
>>>> terms of what arguments we send to compilers.
>>>>
>>>> With that said, we could extend jcheck to separately check if a file 
>>>> contains non-ASCII characters, and deny such changes to be pushed.
>>>>
>>>> The question then becomes: how do we handle exceptions? By having a 
>>>> global "allow-list" containing filenames for files that are 
>>>> acceptable to have non-ASCII characters? By requiring them to have a 
>>>> certain name pattern? By inserting some kind of meta-data character 
>>>> sequence in them that marks them as non-ASCII?
>>>>
>>>> These are the only options I can think of, and none of them sound 
>>>> attractive to me.
>>>>
>>>> A better approach, I think, is to have some kind of jcheck "warning" 
>>>> (not a blocker for integration) that warns you that you have 
>>>> non-ASCII characters in the code you are about to check in. That 
>>>> will, hopefully, be enough to fix unintentional introduction of e.g. 
>>>> typographic quotes, or malicious attacks (given that reviewers are 
>>>> alert for such warnings).
>>>>
>>>> /Magnus
>>>>
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Daniel
>>>>>
>>>>> [1] https://trojansource.codes/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> wt., 7 lut 2023 o 13:28 Magnus Ihse Bursie 
>>>>> <magnus.ihse.bursie at oracle.com> napisał(a):
>>>>>
>>>>>     Currently, the source code in the JDK is in an ill-defined 
>>>>> encoding.
>>>>>     There is no official declaration of the encoding used. It is 
>>>>> "mostly
>>>>>     ASCII", but the relatively few non-ASCII characters used are not
>>>>>     well-defined. In many cases, it is latin-1, but I am pretty 
>>>>> certain
>>>>>     other encodings are used for e.g. Asian translations.
>>>>>
>>>>>     This is is creating unnecessary problems when working with the
>>>>>     JDK code
>>>>>     base, while providing no benefit. We ended up here not by choice,
>>>>>     but by
>>>>>     historical accident. Most recently, this issue has surfaced in
>>>>>     JDK-8301853, JDK-8301854 and JDK-8301855, but there has popped up
>>>>>     issues
>>>>>     relating to this from time to time, e.g. JDK-8263028.
>>>>>
>>>>>     As JEP 400[1] confirms, UTF-8 is the way to go. We should follow
>>>>>     up on
>>>>>     this by converting our code base to UTF-8.
>>>>>
>>>>>     I have created JDK-8301971[2] with the intention of converting
>>>>>     all files
>>>>>     to UTF-8, and updating all infrastructure to recognize this fact.
>>>>>
>>>>>     Even though 99.9% of all text in the JDK repository is ASCII
>>>>>     only, with
>>>>>     a code base the size of the JDK there are of course many, many
>>>>>     instances
>>>>>     that needs to be checked and/or converted. I can take care of the
>>>>>     overarching issues, like updating compiler flags and develop
>>>>>     tooling to
>>>>>     detect, and try to convert non-ASCII files based on my best
>>>>>     guesses, but
>>>>>     in the end, there are likely to be many files which needs to be
>>>>>     verified
>>>>>     by their respective teams, so that I did not assume the incorrect
>>>>>     source
>>>>>     encoding.
>>>>>
>>>>>     So, before I go ahead and start doing this, I want to check:
>>>>>
>>>>>     * Is everyone onboard with this idea? I do assume that in 2023,
>>>>>     having
>>>>>     UTF-8 encoding for text files is (or should be) a no-brainer, but
>>>>>     I want
>>>>>     to verify that there is no-one opposing this.
>>>>>
>>>>>     * Should I open a JEP for this? On the one hand, it is likely to
>>>>>     require
>>>>>     a non-trivial amount of work, but on the other hand, there is no
>>>>>     change
>>>>>     visible for the end user, so it will be kind of pointless to
>>>>>     announce.
>>>>>     For my part, I could go either way, so I'm interested in hearing
>>>>>     opinions, preferably with good rationales, for one way or the 
>>>>> other.
>>>>>
>>>>>     /Magnus
>>>>>
>>>>>     [1] https://openjdk.org/jeps/400
>>>>>     [2] https://bugs.openjdk.org/browse/JDK-8301971
>>>>>
>>>>>


More information about the jdk-dev mailing list