Making the source code utf-8
Stuart Marks
stuart.marks at oracle.com
Tue Feb 7 18:00:34 UTC 2023
Hi, working toward UTF-8 generally sounds like a good idea.
In particular, declaring to git and to javac that files are in UTF-8 makes good
sense, in order to reduce undefined behavior.
I think the issue raised here is that we don't want to allow uncontrolled use of
non-ASCII UTF-8, so for the most part source files should be ASCII. What files
aren't ASCII? Certainly, localized properties files are not. Are there others?
Figuring out why files aren't pure ASCII might help determine how to check and
enforce some policy. Also, what files aren't in UTF-8? That would be interesting to
know as well.
Using name patterns to enforce ASCII on a subset of files might not be too bad.
Files with suffixes like .java, .cpp, .hpp are clearly source code and should be
ASCII. (I'd be interested in hearing if there are any exceptions to this rule.)
Files with other extensions, such as .properties or other various metadata, could be
non-ASCII.
I guess we should be clear on whether this affects "all files in the repository" or
"source code". I probably haven't been as precise as I should have been. It's one
thing to declare to javac that source code is in UTF-8; this clearly affects only
Java source files. I don't know the impact of making such a declaration to git,
though, as that would apply to all files in the repo.
As for a JEP, there is precedent for JEPs to cover things about the code base
itself, such as the modular source layout (JEP 201) and Git/GitHub migration (JEP
357 and JEP 369). Once things settle out, it would be good to document a policy for
what files must be in what encodings, and how this is enforced. This could be an
informational JEP, if the overhead of publishing it isn't too high. (An alternative
might be a README in the repo itself.)
s'marks
On 2/7/23 6:15 AM, Magnus Ihse Bursie wrote:
> On 2023-02-07 14:07, Daniel Jeliński wrote:
>
>> +1 to make the code build regardless of the user's environment / locale.
>>
>> Would it be possible to enforce ASCII by default, and allow UTF-8 in
>> exceptional cases? This would give us one extra layer of protection against
>> trojan sources [1]
>
> ASCII-only certainly has it's advantages, yes, including protecting from that kind
> of attacks.
>
> I think we need to treat the entire code base as UTF-8, e.g. in terms of what
> arguments we send to compilers.
>
> With that said, we could extend jcheck to separately check if a file contains
> non-ASCII characters, and deny such changes to be pushed.
>
> The question then becomes: how do we handle exceptions? By having a global
> "allow-list" containing filenames for files that are acceptable to have non-ASCII
> characters? By requiring them to have a certain name pattern? By inserting some
> kind of meta-data character sequence in them that marks them as non-ASCII?
>
> These are the only options I can think of, and none of them sound attractive to me.
>
> A better approach, I think, is to have some kind of jcheck "warning" (not a
> blocker for integration) that warns you that you have non-ASCII characters in the
> code you are about to check in. That will, hopefully, be enough to fix
> unintentional introduction of e.g. typographic quotes, or malicious attacks (given
> that reviewers are alert for such warnings).
>
> /Magnus
>
>
>
>>
>> Regards,
>> Daniel
>>
>> [1] https://trojansource.codes/
>>
>>
>>
>>
>>
>>
>> wt., 7 lut 2023 o 13:28 Magnus Ihse Bursie <magnus.ihse.bursie at oracle.com>
>> napisał(a):
>>
>> Currently, the source code in the JDK is in an ill-defined encoding.
>> There is no official declaration of the encoding used. It is "mostly
>> ASCII", but the relatively few non-ASCII characters used are not
>> well-defined. In many cases, it is latin-1, but I am pretty certain
>> other encodings are used for e.g. Asian translations.
>>
>> This is is creating unnecessary problems when working with the JDK code
>> base, while providing no benefit. We ended up here not by choice, but by
>> historical accident. Most recently, this issue has surfaced in
>> JDK-8301853, JDK-8301854 and JDK-8301855, but there has popped up issues
>> relating to this from time to time, e.g. JDK-8263028.
>>
>> As JEP 400[1] confirms, UTF-8 is the way to go. We should follow up on
>> this by converting our code base to UTF-8.
>>
>> I have created JDK-8301971[2] with the intention of converting all files
>> to UTF-8, and updating all infrastructure to recognize this fact.
>>
>> Even though 99.9% of all text in the JDK repository is ASCII only, with
>> a code base the size of the JDK there are of course many, many instances
>> that needs to be checked and/or converted. I can take care of the
>> overarching issues, like updating compiler flags and develop tooling to
>> detect, and try to convert non-ASCII files based on my best guesses, but
>> in the end, there are likely to be many files which needs to be verified
>> by their respective teams, so that I did not assume the incorrect source
>> encoding.
>>
>> So, before I go ahead and start doing this, I want to check:
>>
>> * Is everyone onboard with this idea? I do assume that in 2023, having
>> UTF-8 encoding for text files is (or should be) a no-brainer, but I want
>> to verify that there is no-one opposing this.
>>
>> * Should I open a JEP for this? On the one hand, it is likely to require
>> a non-trivial amount of work, but on the other hand, there is no change
>> visible for the end user, so it will be kind of pointless to announce.
>> For my part, I could go either way, so I'm interested in hearing
>> opinions, preferably with good rationales, for one way or the other.
>>
>> /Magnus
>>
>> [1] https://openjdk.org/jeps/400
>> [2] https://bugs.openjdk.org/browse/JDK-8301971
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/jdk-dev/attachments/20230207/84779863/attachment-0001.htm>
More information about the jdk-dev
mailing list