<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>See also this:<br>

      <a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/Unicode_equivalence">https://en.wikipedia.org/wiki/Unicode_equivalence</a></p>

    <p>-- Jon</p>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 3/5/23 3:12 PM, Archie Cobbs wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:CANSoFxv+uqaMuZUCKjnFzZ_vWw5XaUAKjcT9NrjR3KFw8-_4xA@mail.gmail.com">

      

      <div dir="ltr">

        <div>Hi Jon,</div>

        <div><br>

        </div>

        <div>Thanks for taking a look at the patch.<br>

        </div>

        <div dir="ltr"><br>

        </div>

        On Fri, Mar 3, 2023 at 5:07 PM Jonathan Gibbons <<a href="mailto:jonathan.gibbons@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">jonathan.gibbons@oracle.com</a>>

        wrote:

        <div class="gmail_quote">

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

            0.8ex;border-left:1px solid

            rgb(204,204,204);padding-left:1ex">

            <div>I would give you inline code comments, except that it's

              not a PR yet.  I note that I generally distrust the

              `getMessage` for any exception for which the message is

              not formally specified in some way ... in other words,

              don't assume that `e.getMessage()` by itself is

              interesting. </div>

          </blockquote>

          <div><br>

          </div>

          <div> That makes sense, and is easy to fix - thanks for the

            suggestion.<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

            0.8ex;border-left:1px solid

            rgb(204,204,204);padding-left:1ex">

            <div>

              <p>Is it possible to write a test for the bug fix in

                PoolReader?   What is an example of a name encoded in

                two different ways?</p>

            </div>

          </blockquote>

          <div>In any multi-byte UTF-8 sequence, the bytes after the

            first are supposed to all look like <span style="font-family:monospace">0x10xxxxxx</span>. But the

            code is not checking that, so e.g., you could have <span style="font-family:monospace">0x11xxxxxx</span> instead

            and it would encode the same character but not match

            byte-for-byte. For example, è = <span style="font-family:monospace">c3 a8</span>, but <span style="font-family:monospace">Convert.java</span> would

            also accept <span style="font-family:monospace">c3 e8</span>

            or <span style="font-family:monospace">c3 28</span> for

            "è".</div>

          <div><br>

          </div>

          <div>Because the Name hash tables store UTF-8 byte sequences,

            if the same Name were encoded two different ways, it would

            get added to the hash table twice.</div>

          <div><br>

          </div>

          <div>Another way this can happen is e.g. encoding a character

            as a 3-byte sequence when the character is actually small

            enough to fit in a 2-byte sequence. For example, <span style="font-family:monospace">e0 84 80</span> encodes

            character <span style="font-family:monospace">0x0100</span>,

            but it should really be encoded as <span style="font-family:monospace">c4 80</span>.<br>

          </div>

          <div><br>

          </div>

          <div>Thinking more about this, I think I should create a

            separate bug and patch for this particular problem. So,

            expect a digression on that next...<br>

          </div>

          <div><br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

            0.8ex;border-left:1px solid

            rgb(204,204,204);padding-left:1ex">

            <div>

              <p>Although conceptually simple, this is a significant

                change for a very low level data type. It would be worth

                doing more testing than just the usual langtools tests. 

                For example, if you build JDK before and after this

                change, are the generated class files the same?</p>

            </div>

          </blockquote>

          <div>Definitely a test worth doing.<br>

          </div>

          <div><br>

          </div>

          <div>-Archie<br>

          </div>

        </div>

        <br>

        -- <br>

        <div dir="ltr">Archie L. Cobbs<br>

        </div>

      </div>

    </blockquote>

  </body>

</html>