Making the source code utf-8
David Holmes
david.holmes at oracle.com
Wed Feb 8 01:32:10 UTC 2023
On 8/02/2023 4:00 am, Stuart Marks wrote:
> Hi, working toward UTF-8 generally sounds like a good idea.
>
> In particular, declaring to git and to javac that files are in UTF-8
> makes good sense, in order to reduce undefined behavior.
>
> I think the issue raised here is that we don't want to allow
> uncontrolled use of non-ASCII UTF-8, so for the most part source files
> should be ASCII. What files aren't ASCII? Certainly, localized
> properties files are not. Are there others? Figuring out why files
> aren't pure ASCII might help determine how to check and enforce some
> policy. Also, what files aren't in UTF-8? That would be interesting to
> know as well.
>
> Using name patterns to enforce ASCII on a subset of files might not be
> too bad. Files with suffixes like .java, .cpp, .hpp are clearly source
> code and should be ASCII. (I'd be interested in hearing if there are any
> exceptions to this rule.) Files with other extensions, such as
> .properties or other various metadata, could be non-ASCII.
Just to expand on the recent issues here. Limiting source files to ascii
precludes the ability of using non-ascii in comments - specifically
symbols like smart-quotes, mathematical symbols, line symbols for "ascii
art" diagrams etc. I don't have an issue with that as I'm more concerned
about such symbols not being displayable in basic editing environments.
Cheers,
David
-----
> I guess we should be clear on whether this affects "all files in the
> repository" or "source code". I probably haven't been as precise as I
> should have been. It's one thing to declare to javac that source code is
> in UTF-8; this clearly affects only Java source files. I don't know the
> impact of making such a declaration to git, though, as that would apply
> to all files in the repo.
>
> As for a JEP, there is precedent for JEPs to cover things about the code
> base itself, such as the modular source layout (JEP 201) and Git/GitHub
> migration (JEP 357 and JEP 369). Once things settle out, it would be
> good to document a policy for what files must be in what encodings, and
> how this is enforced. This could be an informational JEP, if the
> overhead of publishing it isn't too high. (An alternative might be a
> README in the repo itself.)
>
> s'marks
>
>
> On 2/7/23 6:15 AM, Magnus Ihse Bursie wrote:
>
>> On 2023-02-07 14:07, Daniel Jeliński wrote:
>>
>>> +1 to make the code build regardless of the user's environment / locale.
>>>
>>> Would it be possible to enforce ASCII by default, and allow UTF-8 in
>>> exceptional cases? This would give us one extra layer of protection
>>> against trojan sources [1]
>>
>> ASCII-only certainly has it's advantages, yes, including protecting
>> from that kind of attacks.
>>
>> I think we need to treat the entire code base as UTF-8, e.g. in terms
>> of what arguments we send to compilers.
>>
>> With that said, we could extend jcheck to separately check if a file
>> contains non-ASCII characters, and deny such changes to be pushed.
>>
>> The question then becomes: how do we handle exceptions? By having a
>> global "allow-list" containing filenames for files that are acceptable
>> to have non-ASCII characters? By requiring them to have a certain name
>> pattern? By inserting some kind of meta-data character sequence in
>> them that marks them as non-ASCII?
>>
>> These are the only options I can think of, and none of them sound
>> attractive to me.
>>
>> A better approach, I think, is to have some kind of jcheck "warning"
>> (not a blocker for integration) that warns you that you have non-ASCII
>> characters in the code you are about to check in. That will,
>> hopefully, be enough to fix unintentional introduction of e.g.
>> typographic quotes, or malicious attacks (given that reviewers are
>> alert for such warnings).
>>
>> /Magnus
>>
>>
>>
>>>
>>> Regards,
>>> Daniel
>>>
>>> [1] https://trojansource.codes/
>>>
>>>
>>>
>>>
>>>
>>>
>>> wt., 7 lut 2023 o 13:28 Magnus Ihse Bursie
>>> <magnus.ihse.bursie at oracle.com> napisał(a):
>>>
>>> Currently, the source code in the JDK is in an ill-defined encoding.
>>> There is no official declaration of the encoding used. It is "mostly
>>> ASCII", but the relatively few non-ASCII characters used are not
>>> well-defined. In many cases, it is latin-1, but I am pretty certain
>>> other encodings are used for e.g. Asian translations.
>>>
>>> This is is creating unnecessary problems when working with the
>>> JDK code
>>> base, while providing no benefit. We ended up here not by choice,
>>> but by
>>> historical accident. Most recently, this issue has surfaced in
>>> JDK-8301853, JDK-8301854 and JDK-8301855, but there has popped up
>>> issues
>>> relating to this from time to time, e.g. JDK-8263028.
>>>
>>> As JEP 400[1] confirms, UTF-8 is the way to go. We should follow
>>> up on
>>> this by converting our code base to UTF-8.
>>>
>>> I have created JDK-8301971[2] with the intention of converting
>>> all files
>>> to UTF-8, and updating all infrastructure to recognize this fact.
>>>
>>> Even though 99.9% of all text in the JDK repository is ASCII
>>> only, with
>>> a code base the size of the JDK there are of course many, many
>>> instances
>>> that needs to be checked and/or converted. I can take care of the
>>> overarching issues, like updating compiler flags and develop
>>> tooling to
>>> detect, and try to convert non-ASCII files based on my best
>>> guesses, but
>>> in the end, there are likely to be many files which needs to be
>>> verified
>>> by their respective teams, so that I did not assume the incorrect
>>> source
>>> encoding.
>>>
>>> So, before I go ahead and start doing this, I want to check:
>>>
>>> * Is everyone onboard with this idea? I do assume that in 2023,
>>> having
>>> UTF-8 encoding for text files is (or should be) a no-brainer, but
>>> I want
>>> to verify that there is no-one opposing this.
>>>
>>> * Should I open a JEP for this? On the one hand, it is likely to
>>> require
>>> a non-trivial amount of work, but on the other hand, there is no
>>> change
>>> visible for the end user, so it will be kind of pointless to
>>> announce.
>>> For my part, I could go either way, so I'm interested in hearing
>>> opinions, preferably with good rationales, for one way or the other.
>>>
>>> /Magnus
>>>
>>> [1] https://openjdk.org/jeps/400
>>> [2] https://bugs.openjdk.org/browse/JDK-8301971
>>>
>>>
More information about the jdk-dev
mailing list