Making the source code utf-8

David Holmes david.holmes at oracle.com
Thu Feb 9 02:36:40 UTC 2023


Hit send too soon :)

On 9/02/2023 12:27 pm, David Holmes wrote:
> On 9/02/2023 5:48 am, Magnus Ihse Bursie wrote:
>> On 2023-02-08 02:32, David Holmes wrote:
>>
>>> Just to expand on the recent issues here. Limiting source files to 
>>> ascii precludes the ability of using non-ascii in comments  - 
>>> specifically symbols like smart-quotes, mathematical symbols, line 
>>> symbols for "ascii art" diagrams etc. I don't have an issue with that 
>>> as I'm more concerned about such symbols not being displayable in 
>>> basic editing environments.
>>
>> Am I correct in hearing "my emacs setup" when you say "basic editing 
>> environments"? ;-)
> 
> Yep emacs or vi/vim in a terminal aren't displaying these characters. 
> I've googled the problem for vim and nothing I set seems to solve it. 
> Cat'ing to the terminal also won't display them correctly.
> 
> I've no doubt there is some solution to this but I have better things to 
> do with my time than try and find it. Source code files are not word 
> processing documents, we don't need fancy font symbols being rendered.

Found the problem - my terminal was in ISO-8859-1:1998 (Latin-1, West 
Europe) mode, not UTF-8.

David
----

> Cheers,
> David
> 
>> I understand your concern, but I frankly believe most editing 
>> environments fully support UTF-8 these days, even a trivial emacs or 
>> vi installation in a terminal setup.
>>
>> I do prefer to keep source code to plain ASCII unless necessary, even 
>> though I don't share your concerns over environment failing to display 
>> the code properly. So maybe we can agree on the goal here, even if not 
>> on the reasons. :-)
>>
>> But if you have specific development setups that does not handle UTF-8 
>> correctly, I think we should try to work out a solution for those, and 
>> document it.
>>
>> /Magnus
>>>
>>> Cheers,
>>> David
>>> -----
>>>
>>>> I guess we should be clear on whether this affects "all files in the 
>>>> repository" or "source code". I probably haven't been as precise as 
>>>> I should have been. It's one thing to declare to javac that source 
>>>> code is in UTF-8; this clearly affects only Java source files. I 
>>>> don't know the impact of making such a declaration to git, though, 
>>>> as that would apply to all files in the repo.
>>>>
>>>> As for a JEP, there is precedent for JEPs to cover things about the 
>>>> code base itself, such as the modular source layout (JEP 201) and 
>>>> Git/GitHub migration (JEP 357 and JEP 369). Once things settle out, 
>>>> it would be good to document a policy for what files must be in what 
>>>> encodings, and how this is enforced. This could be an informational 
>>>> JEP, if the overhead of publishing it isn't too high. (An 
>>>> alternative might be a README in the repo itself.)
>>>>
>>>> s'marks
>>>>
>>>>
>>>> On 2/7/23 6:15 AM, Magnus Ihse Bursie wrote:
>>>>
>>>>> On 2023-02-07 14:07, Daniel Jeliński wrote:
>>>>>
>>>>>> +1 to make the code build regardless of the user's environment / 
>>>>>> locale.
>>>>>>
>>>>>> Would it be possible to enforce ASCII by default, and allow UTF-8 
>>>>>> in exceptional cases? This would give us one extra layer of 
>>>>>> protection against trojan sources [1]
>>>>>
>>>>> ASCII-only certainly has it's advantages, yes, including protecting 
>>>>> from that kind of attacks.
>>>>>
>>>>> I think we need to treat the entire code base as UTF-8, e.g. in 
>>>>> terms of what arguments we send to compilers.
>>>>>
>>>>> With that said, we could extend jcheck to separately check if a 
>>>>> file contains non-ASCII characters, and deny such changes to be 
>>>>> pushed.
>>>>>
>>>>> The question then becomes: how do we handle exceptions? By having a 
>>>>> global "allow-list" containing filenames for files that are 
>>>>> acceptable to have non-ASCII characters? By requiring them to have 
>>>>> a certain name pattern? By inserting some kind of meta-data 
>>>>> character sequence in them that marks them as non-ASCII?
>>>>>
>>>>> These are the only options I can think of, and none of them sound 
>>>>> attractive to me.
>>>>>
>>>>> A better approach, I think, is to have some kind of jcheck 
>>>>> "warning" (not a blocker for integration) that warns you that you 
>>>>> have non-ASCII characters in the code you are about to check in. 
>>>>> That will, hopefully, be enough to fix unintentional introduction 
>>>>> of e.g. typographic quotes, or malicious attacks (given that 
>>>>> reviewers are alert for such warnings).
>>>>>
>>>>> /Magnus
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Daniel
>>>>>>
>>>>>> [1] https://trojansource.codes/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> wt., 7 lut 2023 o 13:28 Magnus Ihse Bursie 
>>>>>> <magnus.ihse.bursie at oracle.com> napisał(a):
>>>>>>
>>>>>>     Currently, the source code in the JDK is in an ill-defined 
>>>>>> encoding.
>>>>>>     There is no official declaration of the encoding used. It is 
>>>>>> "mostly
>>>>>>     ASCII", but the relatively few non-ASCII characters used are not
>>>>>>     well-defined. In many cases, it is latin-1, but I am pretty 
>>>>>> certain
>>>>>>     other encodings are used for e.g. Asian translations.
>>>>>>
>>>>>>     This is is creating unnecessary problems when working with the
>>>>>>     JDK code
>>>>>>     base, while providing no benefit. We ended up here not by choice,
>>>>>>     but by
>>>>>>     historical accident. Most recently, this issue has surfaced in
>>>>>>     JDK-8301853, JDK-8301854 and JDK-8301855, but there has popped up
>>>>>>     issues
>>>>>>     relating to this from time to time, e.g. JDK-8263028.
>>>>>>
>>>>>>     As JEP 400[1] confirms, UTF-8 is the way to go. We should follow
>>>>>>     up on
>>>>>>     this by converting our code base to UTF-8.
>>>>>>
>>>>>>     I have created JDK-8301971[2] with the intention of converting
>>>>>>     all files
>>>>>>     to UTF-8, and updating all infrastructure to recognize this fact.
>>>>>>
>>>>>>     Even though 99.9% of all text in the JDK repository is ASCII
>>>>>>     only, with
>>>>>>     a code base the size of the JDK there are of course many, many
>>>>>>     instances
>>>>>>     that needs to be checked and/or converted. I can take care of the
>>>>>>     overarching issues, like updating compiler flags and develop
>>>>>>     tooling to
>>>>>>     detect, and try to convert non-ASCII files based on my best
>>>>>>     guesses, but
>>>>>>     in the end, there are likely to be many files which needs to be
>>>>>>     verified
>>>>>>     by their respective teams, so that I did not assume the incorrect
>>>>>>     source
>>>>>>     encoding.
>>>>>>
>>>>>>     So, before I go ahead and start doing this, I want to check:
>>>>>>
>>>>>>     * Is everyone onboard with this idea? I do assume that in 2023,
>>>>>>     having
>>>>>>     UTF-8 encoding for text files is (or should be) a no-brainer, but
>>>>>>     I want
>>>>>>     to verify that there is no-one opposing this.
>>>>>>
>>>>>>     * Should I open a JEP for this? On the one hand, it is likely to
>>>>>>     require
>>>>>>     a non-trivial amount of work, but on the other hand, there is no
>>>>>>     change
>>>>>>     visible for the end user, so it will be kind of pointless to
>>>>>>     announce.
>>>>>>     For my part, I could go either way, so I'm interested in hearing
>>>>>>     opinions, preferably with good rationales, for one way or the 
>>>>>> other.
>>>>>>
>>>>>>     /Magnus
>>>>>>
>>>>>>     [1] https://openjdk.org/jeps/400
>>>>>>     [2] https://bugs.openjdk.org/browse/JDK-8301971
>>>>>>
>>>>>>


More information about the jdk-dev mailing list