Making the source code utf-8
David Holmes
david.holmes at oracle.com
Thu Feb 9 02:36:40 UTC 2023
Hit send too soon :)
On 9/02/2023 12:27 pm, David Holmes wrote:
> On 9/02/2023 5:48 am, Magnus Ihse Bursie wrote:
>> On 2023-02-08 02:32, David Holmes wrote:
>>
>>> Just to expand on the recent issues here. Limiting source files to
>>> ascii precludes the ability of using non-ascii in comments -
>>> specifically symbols like smart-quotes, mathematical symbols, line
>>> symbols for "ascii art" diagrams etc. I don't have an issue with that
>>> as I'm more concerned about such symbols not being displayable in
>>> basic editing environments.
>>
>> Am I correct in hearing "my emacs setup" when you say "basic editing
>> environments"? ;-)
>
> Yep emacs or vi/vim in a terminal aren't displaying these characters.
> I've googled the problem for vim and nothing I set seems to solve it.
> Cat'ing to the terminal also won't display them correctly.
>
> I've no doubt there is some solution to this but I have better things to
> do with my time than try and find it. Source code files are not word
> processing documents, we don't need fancy font symbols being rendered.
Found the problem - my terminal was in ISO-8859-1:1998 (Latin-1, West
Europe) mode, not UTF-8.
David
----
> Cheers,
> David
>
>> I understand your concern, but I frankly believe most editing
>> environments fully support UTF-8 these days, even a trivial emacs or
>> vi installation in a terminal setup.
>>
>> I do prefer to keep source code to plain ASCII unless necessary, even
>> though I don't share your concerns over environment failing to display
>> the code properly. So maybe we can agree on the goal here, even if not
>> on the reasons. :-)
>>
>> But if you have specific development setups that does not handle UTF-8
>> correctly, I think we should try to work out a solution for those, and
>> document it.
>>
>> /Magnus
>>>
>>> Cheers,
>>> David
>>> -----
>>>
>>>> I guess we should be clear on whether this affects "all files in the
>>>> repository" or "source code". I probably haven't been as precise as
>>>> I should have been. It's one thing to declare to javac that source
>>>> code is in UTF-8; this clearly affects only Java source files. I
>>>> don't know the impact of making such a declaration to git, though,
>>>> as that would apply to all files in the repo.
>>>>
>>>> As for a JEP, there is precedent for JEPs to cover things about the
>>>> code base itself, such as the modular source layout (JEP 201) and
>>>> Git/GitHub migration (JEP 357 and JEP 369). Once things settle out,
>>>> it would be good to document a policy for what files must be in what
>>>> encodings, and how this is enforced. This could be an informational
>>>> JEP, if the overhead of publishing it isn't too high. (An
>>>> alternative might be a README in the repo itself.)
>>>>
>>>> s'marks
>>>>
>>>>
>>>> On 2/7/23 6:15 AM, Magnus Ihse Bursie wrote:
>>>>
>>>>> On 2023-02-07 14:07, Daniel Jeliński wrote:
>>>>>
>>>>>> +1 to make the code build regardless of the user's environment /
>>>>>> locale.
>>>>>>
>>>>>> Would it be possible to enforce ASCII by default, and allow UTF-8
>>>>>> in exceptional cases? This would give us one extra layer of
>>>>>> protection against trojan sources [1]
>>>>>
>>>>> ASCII-only certainly has it's advantages, yes, including protecting
>>>>> from that kind of attacks.
>>>>>
>>>>> I think we need to treat the entire code base as UTF-8, e.g. in
>>>>> terms of what arguments we send to compilers.
>>>>>
>>>>> With that said, we could extend jcheck to separately check if a
>>>>> file contains non-ASCII characters, and deny such changes to be
>>>>> pushed.
>>>>>
>>>>> The question then becomes: how do we handle exceptions? By having a
>>>>> global "allow-list" containing filenames for files that are
>>>>> acceptable to have non-ASCII characters? By requiring them to have
>>>>> a certain name pattern? By inserting some kind of meta-data
>>>>> character sequence in them that marks them as non-ASCII?
>>>>>
>>>>> These are the only options I can think of, and none of them sound
>>>>> attractive to me.
>>>>>
>>>>> A better approach, I think, is to have some kind of jcheck
>>>>> "warning" (not a blocker for integration) that warns you that you
>>>>> have non-ASCII characters in the code you are about to check in.
>>>>> That will, hopefully, be enough to fix unintentional introduction
>>>>> of e.g. typographic quotes, or malicious attacks (given that
>>>>> reviewers are alert for such warnings).
>>>>>
>>>>> /Magnus
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Daniel
>>>>>>
>>>>>> [1] https://trojansource.codes/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> wt., 7 lut 2023 o 13:28 Magnus Ihse Bursie
>>>>>> <magnus.ihse.bursie at oracle.com> napisał(a):
>>>>>>
>>>>>> Currently, the source code in the JDK is in an ill-defined
>>>>>> encoding.
>>>>>> There is no official declaration of the encoding used. It is
>>>>>> "mostly
>>>>>> ASCII", but the relatively few non-ASCII characters used are not
>>>>>> well-defined. In many cases, it is latin-1, but I am pretty
>>>>>> certain
>>>>>> other encodings are used for e.g. Asian translations.
>>>>>>
>>>>>> This is is creating unnecessary problems when working with the
>>>>>> JDK code
>>>>>> base, while providing no benefit. We ended up here not by choice,
>>>>>> but by
>>>>>> historical accident. Most recently, this issue has surfaced in
>>>>>> JDK-8301853, JDK-8301854 and JDK-8301855, but there has popped up
>>>>>> issues
>>>>>> relating to this from time to time, e.g. JDK-8263028.
>>>>>>
>>>>>> As JEP 400[1] confirms, UTF-8 is the way to go. We should follow
>>>>>> up on
>>>>>> this by converting our code base to UTF-8.
>>>>>>
>>>>>> I have created JDK-8301971[2] with the intention of converting
>>>>>> all files
>>>>>> to UTF-8, and updating all infrastructure to recognize this fact.
>>>>>>
>>>>>> Even though 99.9% of all text in the JDK repository is ASCII
>>>>>> only, with
>>>>>> a code base the size of the JDK there are of course many, many
>>>>>> instances
>>>>>> that needs to be checked and/or converted. I can take care of the
>>>>>> overarching issues, like updating compiler flags and develop
>>>>>> tooling to
>>>>>> detect, and try to convert non-ASCII files based on my best
>>>>>> guesses, but
>>>>>> in the end, there are likely to be many files which needs to be
>>>>>> verified
>>>>>> by their respective teams, so that I did not assume the incorrect
>>>>>> source
>>>>>> encoding.
>>>>>>
>>>>>> So, before I go ahead and start doing this, I want to check:
>>>>>>
>>>>>> * Is everyone onboard with this idea? I do assume that in 2023,
>>>>>> having
>>>>>> UTF-8 encoding for text files is (or should be) a no-brainer, but
>>>>>> I want
>>>>>> to verify that there is no-one opposing this.
>>>>>>
>>>>>> * Should I open a JEP for this? On the one hand, it is likely to
>>>>>> require
>>>>>> a non-trivial amount of work, but on the other hand, there is no
>>>>>> change
>>>>>> visible for the end user, so it will be kind of pointless to
>>>>>> announce.
>>>>>> For my part, I could go either way, so I'm interested in hearing
>>>>>> opinions, preferably with good rationales, for one way or the
>>>>>> other.
>>>>>>
>>>>>> /Magnus
>>>>>>
>>>>>> [1] https://openjdk.org/jeps/400
>>>>>> [2] https://bugs.openjdk.org/browse/JDK-8301971
>>>>>>
>>>>>>
More information about the jdk-dev
mailing list