Making the source code utf-8

Wed Feb 8 19:50:17 UTC 2023

On 2023-02-08 19:54, Jonathan Gibbons wrote:

> There are places in doc comments where entities need to be used
> for non-ASCII characters, such as accented letters.

That's good to know. However, I don't think html entities will even come 
into play in this case, since they are pure ASCII. Unless someone tries 
to do some kind of search and replace for html entities with Unicode 
characters, and I frankly don't seen any reason to do that as part of a 
conversion of the code page.

/Magnus

>
> -- Jon
>
> On 2/7/23 7:49 PM, Yasumasa Suenaga wrote:
>> I give big +1 to this idea, thanks Magnus!
>>
>>
>> 2023-02-07 21:28 に Magnus Ihse Bursie さんは書きました:
>>> Currently, the source code in the JDK is in an ill-defined encoding.
>>> There is no official declaration of the encoding used. It is "mostly
>>> ASCII", but the relatively few non-ASCII characters used are not
>>> well-defined. In many cases, it is latin-1, but I am pretty certain
>>> other encodings are used for e.g. Asian translations.
>>>
>>> This is is creating unnecessary problems when working with the JDK
>>> code base, while providing no benefit. We ended up here not by choice,
>>> but by historical accident. Most recently, this issue has surfaced in
>>> JDK-8301853, JDK-8301854 and JDK-8301855, but there has popped up
>>> issues relating to this from time to time, e.g. JDK-8263028.
>>>
>>> As JEP 400[1] confirms, UTF-8 is the way to go. We should follow up on
>>> this by converting our code base to UTF-8.
>>>
>>> I have created JDK-8301971[2] with the intention of converting all
>>> files to UTF-8, and updating all infrastructure to recognize this
>>> fact.
>>>
>>> Even though 99.9% of all text in the JDK repository is ASCII only,
>>> with a code base the size of the JDK there are of course many, many
>>> instances that needs to be checked and/or converted. I can take care
>>> of the overarching issues, like updating compiler flags and develop
>>> tooling to detect, and try to convert non-ASCII files based on my best
>>> guesses, but in the end, there are likely to be many files which needs
>>> to be verified by their respective teams, so that I did not assume the
>>> incorrect source encoding.
>>>
>>> So, before I go ahead and start doing this, I want to check:
>>>
>>> * Is everyone onboard with this idea? I do assume that in 2023, having
>>> UTF-8 encoding for text files is (or should be) a no-brainer, but I
>>> want to verify that there is no-one opposing this.
>>>
>>> * Should I open a JEP for this? On the one hand, it is likely to
>>> require a non-trivial amount of work, but on the other hand, there is
>>> no change visible for the end user, so it will be kind of pointless to
>>> announce. For my part, I could go either way, so I'm interested in
>>> hearing opinions, preferably with good rationales, for one way or the
>>> other.
>>>
>>> /Magnus
>>>
>>> [1] https://openjdk.org/jeps/400
>>> [2] https://bugs.openjdk.org/browse/JDK-8301971