<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: "Calibri Light", "Helvetica Light", sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
Hi Brett, you are right.</div>
<div style="font-family: "Calibri Light", "Helvetica Light", sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: "Calibri Light", "Helvetica Light", sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
In the scenario Roger described, the current code would already be unsafe because the hasNegatives check and the clone copy are two distinct reads to each array element, between which a mutation can happen.</div>
<div style="font-family: "Calibri Light", "Helvetica Light", sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: "Calibri Light", "Helvetica Light", sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
Roger is mainly concerned because byte[] with LATIN1 or UTF16 data exists not only in String, but also in StringBuilder, which has the problems Roger described.</div>
<div style="font-family: "Calibri Light", "Helvetica Light", sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: "Calibri Light", "Helvetica Light", sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
Currently, these methods in the call hierarchy do not make it clear that they are not tolerant of mutable arrays.</div>
<div style="font-family: "Calibri Light", "Helvetica Light", sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: "Calibri Light", "Helvetica Light", sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
It may be reasonable to mark the encoding methods as so with comments, or more simply, by converting them to private instance methods that always operate on String instances, making sure no invalid combinations of value-coder can ever pass through.</div>
<div style="font-family: "Calibri Light", "Helvetica Light", sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: "Calibri Light", "Helvetica Light", sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
Chen</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> core-libs-dev <core-libs-dev-retn@openjdk.org> on behalf of Brett Okken <brett.okken.os@gmail.com><br>
<b>Sent:</b> Monday, July 28, 2025 4:59 PM<br>
<b>To:</b> Roger Riggs <roger.riggs@oracle.com><br>
<b>Cc:</b> core-libs-dev@openjdk.org <core-libs-dev@openjdk.org><br>
<b>Subject:</b> Re: String encodeUTF8 latin1 with negatives</font>
<div> </div>
</div>
<div>
<div dir="ltr">
<div>Roger,</div>
<div><br>
</div>
<div>For a String, the byte[] val is immutable, right?</div>
<div>And even the current behavior of checking for negatives and then cloning would not be safe in the face of a concurrent modification, right?</div>
<div>Is there something else going on here which I am missing?</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Brett</div>
</div>
<br>
<div class="x_gmail_quote x_gmail_quote_container">
<div dir="ltr" class="x_gmail_attr">On Mon, Jul 28, 2025 at 3:19 PM Roger Riggs <<a href="mailto:roger.riggs@oracle.com">roger.riggs@oracle.com</a>> wrote:<br>
</div>
<blockquote class="x_gmail_quote" style="margin:0px 0px 0px 0.8ex; border-left:1px solid rgb(204,204,204); padding-left:1ex">
<u></u>
<div>Hi Brett,<br>
<br>
Extra care is needed if the input array might be modified concurrently with the method execution.<br>
When control flow decisions are made based on array contents, the integrity of the result depends on reading each byte of the array exactly once.<br>
<br>
Regards, Roger<br>
<br>
<br>
<br>
<div>On 7/27/25 4:45 PM, Brett Okken wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>In String.encodeUTF8, when the coder is latin1, there is a call to StringCoding.hasNegatives to determine if any special handling is needed. If not, a clone of the val is returned.</div>
<div>If there are negative values, it then loops, from the beginning, through all the values to handle any individual negative values.</div>
<div><br>
</div>
<div>Would it be better to call StringCoding.countPositives? If the result equals the length, the clone can still be returned. But if it does not, all the values which are positive can be simply copied to the target byte[] and only values beyond that point
need to be checked again.</div>
<div><br>
</div>
<div><a href="https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L1287-L1300" target="_blank">https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L1287-L1300</a></div>
<div><br>
</div>
<div> if (!StringCoding.hasNegatives(val, 0, val.length)) {<br>
return val.clone();<br>
}<br>
<br>
int dp = 0;<br>
byte[] dst = StringUTF16.newBytesFor(val.length);<br>
for (byte c : val) {<br>
if (c < 0) {<br>
dst[dp++] = (byte) (0xc0 | ((c & 0xff) >> 6));<br>
dst[dp++] = (byte) (0x80 | (c & 0x3f));<br>
} else {<br>
dst[dp++] = c;<br>
}<br>
}</div>
<div><br>
</div>
<div><br>
</div>
<div>Can be changed to look like:<br>
</div>
<div><br>
</div>
<div> int positives = StringCoding.countPositives(val, 0, val.length);<br>
if (positives == val.length) {<br>
return val.clone();<br>
}</div>
<div><br>
</div>
<div> <span><span>int</span> <span>dp</span> = <span>positives</span>;</span>
<br>
byte[] dst = StringUTF16.newBytesFor(val.length);</div>
<div> if (positives > 0) {<br>
System.arraycopy(val, 0, dst, 0, positives);<br>
}<br>
for (int i=dp; i<val.length; ++i) {<br>
byte c = val[i];
<div> if (c < 0) {<br>
dst[dp++] = (byte) (0xc0 | ((c & 0xff) >> 6));<br>
dst[dp++] = (byte) (0x80 | (c & 0x3f));<br>
} else {<br>
dst[dp++] = c;<br>
}<br>
}</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div>I have done a bit of testing with the StringEncode jmh benchmark on my local windows device.</div>
<div><br>
</div>
encodeLatin1LongEnd speeds up significantly (~70%)</div>
<div>encodeLatin1LongStart slows down (~20%)</div>
<div>encodeLatin1Mixed speeds up by ~30%</div>
<div><br>
</div>
<div>The remaining tests do not show much difference either way.</div>
<div><br>
</div>
<div>Brett</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
</blockquote>
<br>
</div>
</blockquote>
</div>
</div>
</body>
</html>