String encodeUTF8 latin1 with negatives

Mon Jul 28 23:47:12 UTC 2025

Chen,

Thanks for the clarification. The StringCoding methods are a bit light on
documentation, but presumably because of the package scope.

The specific method I am proposing a change to is a private method in
String.

Thanks,
Brett

On Mon, Jul 28, 2025 at 5:26 PM Chen Liang <chen.l.liang at oracle.com> wrote:

> Hi Brett, you are right.
>
> In the scenario Roger described, the current code would already be unsafe
> because the hasNegatives check and the clone copy are two distinct reads to
> each array element, between which a mutation can happen.
>
> Roger is mainly concerned because byte[] with LATIN1 or UTF16 data exists
> not only in String, but also in StringBuilder, which has the problems Roger
> described.
>
> Currently, these methods in the call hierarchy do not make it clear that
> they are not tolerant of mutable arrays.
>
> It may be reasonable to mark the encoding methods as so with comments, or
> more simply, by converting them to private instance methods that always
> operate on String instances, making sure no invalid combinations of
> value-coder can ever pass through.
>
> Chen
> ------------------------------
> *From:* core-libs-dev <core-libs-dev-retn at openjdk.org> on behalf of Brett
> Okken <brett.okken.os at gmail.com>
> *Sent:* Monday, July 28, 2025 4:59 PM
> *To:* Roger Riggs <roger.riggs at oracle.com>
> *Cc:* core-libs-dev at openjdk.org <core-libs-dev at openjdk.org>
> *Subject:* Re: String encodeUTF8 latin1 with negatives
>
> Roger,
>
> For a String, the byte[] val is immutable, right?
> And even the current behavior of checking for negatives and then cloning
> would not be safe in the face of a concurrent modification, right?
> Is there something else going on here which I am missing?
>
> Thanks,
> Brett
>
> On Mon, Jul 28, 2025 at 3:19 PM Roger Riggs <roger.riggs at oracle.com>
> wrote:
>
> Hi Brett,
>
> Extra care is needed if the input array might be modified concurrently
> with the method execution.
> When control flow decisions are made based on array contents, the
> integrity of the result depends on reading each byte of the array exactly
> once.
>
> Regards, Roger
>
>
>
> On 7/27/25 4:45 PM, Brett Okken wrote:
>
> In String.encodeUTF8, when the coder is latin1, there is a call to
> StringCoding.hasNegatives to determine if any special handling is needed.
> If not, a clone of the val is returned.
> If there are negative values, it then loops, from the beginning, through
> all the values to handle any individual negative values.
>
> Would it be better to call StringCoding.countPositives? If the result
> equals the length, the clone can still be returned. But if it does not, all
> the values which are positive can be simply copied to the target byte[] and
> only values beyond that point need to be checked again.
>
>
> https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L1287-L1300
>
>         if (!StringCoding.hasNegatives(val, 0, val.length)) {
>             return val.clone();
>         }
>
>         int dp = 0;
>         byte[] dst = StringUTF16.newBytesFor(val.length);
>         for (byte c : val) {
>             if (c < 0) {
>                 dst[dp++] = (byte) (0xc0 | ((c & 0xff) >> 6));
>                 dst[dp++] = (byte) (0x80 | (c & 0x3f));
>             } else {
>                 dst[dp++] = c;
>             }
>         }
>
>
> Can be changed to look like:
>
>         int positives = StringCoding.countPositives(val, 0, val.length);
>         if (positives == val.length) {
>             return val.clone();
>         }
>
>         int dp = positives;
>         byte[] dst = StringUTF16.newBytesFor(val.length);
>         if (positives > 0) {
>             System.arraycopy(val, 0, dst, 0, positives);
>         }
>         for (int i=dp; i<val.length; ++i) {
>             byte c = val[i];
>             if (c < 0) {
>                 dst[dp++] = (byte) (0xc0 | ((c & 0xff) >> 6));
>                 dst[dp++] = (byte) (0x80 | (c & 0x3f));
>             } else {
>                 dst[dp++] = c;
>             }
>         }
>
>
>
> I have done a bit of testing with the StringEncode jmh benchmark on my
> local windows device.
>
> encodeLatin1LongEnd speeds up significantly (~70%)
> encodeLatin1LongStart slows down (~20%)
> encodeLatin1Mixed speeds up by ~30%
>
> The remaining tests do not show much difference either way.
>
> Brett
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20250728/bfde51d0/attachment.htm>