<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>See also this:<br>
<a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/Unicode_equivalence">https://en.wikipedia.org/wiki/Unicode_equivalence</a></p>
<p>-- Jon</p>
<p><br>
</p>
<div class="moz-cite-prefix">On 3/5/23 3:12 PM, Archie Cobbs wrote:<br>
</div>
<blockquote type="cite" cite="mid:CANSoFxv+uqaMuZUCKjnFzZ_vWw5XaUAKjcT9NrjR3KFw8-_4xA@mail.gmail.com">
<div dir="ltr">
<div>Hi Jon,</div>
<div><br>
</div>
<div>Thanks for taking a look at the patch.<br>
</div>
<div dir="ltr"><br>
</div>
On Fri, Mar 3, 2023 at 5:07 PM Jonathan Gibbons <<a href="mailto:jonathan.gibbons@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">jonathan.gibbons@oracle.com</a>>
wrote:
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>I would give you inline code comments, except that it's
not a PR yet. I note that I generally distrust the
`getMessage` for any exception for which the message is
not formally specified in some way ... in other words,
don't assume that `e.getMessage()` by itself is
interesting. </div>
</blockquote>
<div><br>
</div>
<div> That makes sense, and is easy to fix - thanks for the
suggestion.<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>Is it possible to write a test for the bug fix in
PoolReader? What is an example of a name encoded in
two different ways?</p>
</div>
</blockquote>
<div>In any multi-byte UTF-8 sequence, the bytes after the
first are supposed to all look like <span style="font-family:monospace">0x10xxxxxx</span>. But the
code is not checking that, so e.g., you could have <span style="font-family:monospace">0x11xxxxxx</span> instead
and it would encode the same character but not match
byte-for-byte. For example, è = <span style="font-family:monospace">c3 a8</span>, but <span style="font-family:monospace">Convert.java</span> would
also accept <span style="font-family:monospace">c3 e8</span>
or <span style="font-family:monospace">c3 28</span> for
"è".</div>
<div><br>
</div>
<div>Because the Name hash tables store UTF-8 byte sequences,
if the same Name were encoded two different ways, it would
get added to the hash table twice.</div>
<div><br>
</div>
<div>Another way this can happen is e.g. encoding a character
as a 3-byte sequence when the character is actually small
enough to fit in a 2-byte sequence. For example, <span style="font-family:monospace">e0 84 80</span> encodes
character <span style="font-family:monospace">0x0100</span>,
but it should really be encoded as <span style="font-family:monospace">c4 80</span>.<br>
</div>
<div><br>
</div>
<div>Thinking more about this, I think I should create a
separate bug and patch for this particular problem. So,
expect a digression on that next...<br>
</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>Although conceptually simple, this is a significant
change for a very low level data type. It would be worth
doing more testing than just the usual langtools tests.
For example, if you build JDK before and after this
change, are the generated class files the same?</p>
</div>
</blockquote>
<div>Definitely a test worth doing.<br>
</div>
<div><br>
</div>
<div>-Archie<br>
</div>
</div>
<br>
-- <br>
<div dir="ltr">Archie L. Cobbs<br>
</div>
</div>
</blockquote>
</body>
</html>