<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>Hi, working toward UTF-8 generally sounds like a good idea.</p>

    <p>In particular, declaring to git and to javac that files are in

      UTF-8 makes good sense, in order to reduce undefined behavior.</p>

    <p>I think the issue raised here is that we don't want to allow

      uncontrolled use of non-ASCII UTF-8, so for the most part source

      files should be ASCII. What files aren't ASCII? Certainly,

      localized properties files are not. Are there others? Figuring out

      why files aren't pure ASCII might help determine how to check and

      enforce some policy. Also, what files aren't in UTF-8? That would

      be interesting to know as well.</p>

    <p>Using name patterns to enforce ASCII on a subset of files might

      not be too bad. Files with suffixes like .java, .cpp, .hpp are

      clearly source code and should be ASCII. (I'd be interested in

      hearing if there are any exceptions to this rule.) Files with

      other extensions, such as .properties or other various metadata,

      could be non-ASCII.</p>

    <p>I guess we should be clear on whether this affects "all files in

      the repository" or "source code". I probably haven't been as

      precise as I should have been. It's one thing to declare to javac

      that source code is in UTF-8; this clearly affects only Java

      source files. I don't know the impact of making such a declaration

      to git, though, as that would apply to all files in the repo.<br>

    </p>

    <p>As for a JEP, there is precedent for JEPs to cover things about

      the code base itself, such as the modular source layout (JEP 201)

      and Git/GitHub migration (JEP 357 and JEP 369). Once things settle

      out, it would be good to document a policy for what files must be

      in what encodings, and how this is enforced. This could be an

      informational JEP, if the overhead of publishing it isn't too

      high. (An alternative might be a README in the repo itself.)</p>

    <p>s'marks<br>

    </p>

    <br>

    <p>On 2/7/23 6:15 AM, Magnus Ihse Bursie wrote:</p>

    <blockquote type="cite" cite="mid:4cce809f-6ffc-d0e1-0fc0-46c113dd598c@oracle.com">

      

      <p>On 2023-02-07 14:07, Daniel Jeliński wrote:<br>

      </p>

      <blockquote type="cite" cite="mid:CAMrH03KDwJ+9vbhQ8EQG6nkuYcuCyK0BqB7AOmtmAaCAQL716Q@mail.gmail.com">

        <div dir="ltr">+1 to make the code build regardless of the

          user's environment / locale.<br>

          <div><br>

          </div>

          <div>Would it be possible to enforce ASCII by default, and

            allow UTF-8 in exceptional cases? This would give us one

            extra layer of protection against trojan sources [1]</div>

        </div>

      </blockquote>

      <p>ASCII-only certainly has it's advantages, yes, including

        protecting from that kind of attacks. <br>

      </p>

      <p>I think we need to treat the entire code base as UTF-8, e.g. in

        terms of what arguments we send to compilers. <br>

      </p>

      <p>With that said, we could extend jcheck to separately check if a

        file contains non-ASCII characters, and deny such changes to be

        pushed. <br>

      </p>

      <p>The question then becomes: how do we handle exceptions? By

        having a global "allow-list" containing filenames for files that

        are acceptable to have non-ASCII characters? By requiring them

        to have a certain name pattern? By inserting some kind of

        meta-data character sequence in them that marks them as

        non-ASCII?</p>

      <p>These are the only options I can think of, and none of them

        sound attractive to me.<br>

      </p>

      <p>A better approach, I think, is to have some kind of jcheck

        "warning" (not a blocker for integration) that warns you that

        you have non-ASCII characters in the code you are about to check

        in. That will, hopefully, be enough to fix unintentional

        introduction of e.g. typographic quotes, or malicious attacks

        (given that reviewers are alert for such warnings).</p>

      <p>/Magnus<br>

      </p>

      <p><br>

      </p>

      <p><br>

      </p>

      <blockquote type="cite" cite="mid:CAMrH03KDwJ+9vbhQ8EQG6nkuYcuCyK0BqB7AOmtmAaCAQL716Q@mail.gmail.com">

        <div dir="ltr">

          <div><br>

          </div>

          <div>Regards,</div>

          <div>Daniel</div>

          <div><br>

          </div>

          <div>[1] <a href="https://trojansource.codes/" moz-do-not-send="true" class="moz-txt-link-freetext">https://trojansource.codes/</a></div>

          <div><br>

          </div>

          <div><br>

          </div>

          <div><br>

          </div>

          <div><br>

          </div>

          <div><br>

          </div>

        </div>

        <br>

        <div class="gmail_quote">

          <div dir="ltr" class="gmail_attr">wt., 7 lut 2023 o

            13:28 Magnus Ihse Bursie <<a href="mailto:magnus.ihse.bursie@oracle.com" moz-do-not-send="true" class="moz-txt-link-freetext">magnus.ihse.bursie@oracle.com</a>>

            napisał(a):<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

            0.8ex;border-left:1px solid

            rgb(204,204,204);padding-left:1ex">Currently, the source

            code in the JDK is in an ill-defined encoding. <br>

            There is no official declaration of the encoding used. It is

            "mostly <br>

            ASCII", but the relatively few non-ASCII characters used are

            not <br>

            well-defined. In many cases, it is latin-1, but I am pretty

            certain <br>

            other encodings are used for e.g. Asian translations.<br>

            <br>

            This is is creating unnecessary problems when working with

            the JDK code <br>

            base, while providing no benefit. We ended up here not by

            choice, but by <br>

            historical accident. Most recently, this issue has surfaced

            in <br>

            JDK-8301853, JDK-8301854 and JDK-8301855, but there has

            popped up issues <br>

            relating to this from time to time, e.g. JDK-8263028.<br>

            <br>

            As JEP 400[1] confirms, UTF-8 is the way to go. We should

            follow up on <br>

            this by converting our code base to UTF-8.<br>

            <br>

            I have created JDK-8301971[2] with the intention of

            converting all files <br>

            to UTF-8, and updating all infrastructure to recognize this

            fact.<br>

            <br>

            Even though 99.9% of all text in the JDK repository is ASCII

            only, with <br>

            a code base the size of the JDK there are of course many,

            many instances <br>

            that needs to be checked and/or converted. I can take care

            of the <br>

            overarching issues, like updating compiler flags and develop

            tooling to <br>

            detect, and try to convert non-ASCII files based on my best

            guesses, but <br>

            in the end, there are likely to be many files which needs to

            be verified <br>

            by their respective teams, so that I did not assume the

            incorrect source <br>

            encoding.<br>

            <br>

            So, before I go ahead and start doing this, I want to check:<br>

            <br>

            * Is everyone onboard with this idea? I do assume that in

            2023, having <br>

            UTF-8 encoding for text files is (or should be) a

            no-brainer, but I want <br>

            to verify that there is no-one opposing this.<br>

            <br>

            * Should I open a JEP for this? On the one hand, it is

            likely to require <br>

            a non-trivial amount of work, but on the other hand, there

            is no change <br>

            visible for the end user, so it will be kind of pointless to

            announce. <br>

            For my part, I could go either way, so I'm interested in

            hearing <br>

            opinions, preferably with good rationales, for one way or

            the other.<br>

            <br>

            /Magnus<br>

            <br>

            [1] <a href="https://openjdk.org/jeps/400" rel="noreferrer" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://openjdk.org/jeps/400</a><br>

            [2] <a href="https://bugs.openjdk.org/browse/JDK-8301971" rel="noreferrer" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://bugs.openjdk.org/browse/JDK-8301971</a><br>

            <br>

            <br>

          </blockquote>

        </div>

      </blockquote>

    </blockquote>

  </body>

</html>