large longs to string

Wed Apr 24 23:42:44 UTC 2024

Claes,

> IIUC this optimization leans on 4 long divs being slower than 1 long div + 4 int divs

Exactly. I think there is also some benefit from unrolling the 4 int
digit pair operations.

> which might not be true on all platforms, nor stay true in the future

Agreed, but I am not sure how to reason about that.

> Long values will in practice likely be biased towards lower values, so it’s important that any optimization to .. longer values doesn’t regress inputs in the int range. Since it’s all guarded by a test that is already there there shouldn’t be much room for a difference, but adding code can cause interesting issues so it’s always worth measuring to make sure. Have you run any benchmark for inputs smaller than the threshold? And for a healthy mix of values?

I had been focused on "longer" values, as I have uses which are
effectively guaranteed to have some bits from 53-63 set.
I added some tests for "int" values as well as a mix with 2 "longs"
and 3 "ints". This showed a pretty small regression for the int and
mixed case. That regression went away by changing back to a while loop
comparing to Integer.MIN_VALUE, and that did not give up much of the
gain.

private static final long[] ints = new long[] {Integer.MAX_VALUE,
12345, -5432, 654321, 5};
private static final long[] longs = new long[] {235789987235L,
235789987235326L, -98975433216803632L, Long.MAX_VALUE};
private static final long[] mixed = new long[] {5, Long.MIN_VALUE,
654321, 9876543210L, 543};

Benchmark                         (type)  Mode  Cnt   Score    Error  Units
LongToStringBenchmark.baseline       int  avgt    3  24.105 ┬▒ 11.751  ns/op
LongToStringBenchmark.baseline      long  avgt    3  51.223 ┬▒ 21.069  ns/op
LongToStringBenchmark.baseline       mix  avgt    3  34.849 ┬▒  7.270  ns/op
LongToStringBenchmark.change         int  avgt    3  25.846 ┬▒  2.012  ns/op
LongToStringBenchmark.change        long  avgt    3  43.053 ┬▒ 13.886  ns/op
LongToStringBenchmark.change         mix  avgt    3  35.826 ┬▒  2.901  ns/op
LongToStringBenchmark.changeLoop     int  avgt    3  24.261 ┬▒  7.325  ns/op
LongToStringBenchmark.changeLoop    long  avgt    3  44.254 ┬▒ 22.921  ns/op
LongToStringBenchmark.changeLoop     mix  avgt    3  29.501 ┬▒  8.693  ns/op

"change" is what I had proposed originally and "changeLoop" is:

while (i <= Integer.MIN_VALUE) {
  long q = i / 100_000_000;
  charPos -= 8;
  write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
  v = q;
}

Brett

On Wed, Apr 24, 2024 at 2:39 PM Claes Redestad
<claes.redestad at oracle.com> wrote:
>
> Hi,
>
> IIUC this optimization leans on 4 long divs being slower than 1 long div + 4 int divs, which might not be true on all platforms, nor stay true in the future. Long values will in practice likely be biased towards lower values, so it’s important that any optimization to .. longer values doesn’t regress inputs in the int range. Since it’s all guarded by a test that is already there there shouldn’t be much room for a difference, but adding code can cause interesting issues so it’s always worth measuring to make sure. Have you run any benchmark for inputs smaller than the threshold? And for a healthy mix of values?
>
> Thanks!
> /Claes
>
> 24 apr. 2024 kl. 21:08 skrev Brett Okken <brett.okken.os at gmail.com>:
>
> Is there interest in optimizing StringLatin1.getChars(long, int, byte[]) for large (larger than int) long values[1]?
> We can change this to work with 8 digits at a time, which reduces the amount of 64 bit arithmetic required.
>
> if (i <= -1_000_000_000) {
> long q = i / 100_000_000;
> charPos -= 8;
> write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
> i = q;
> if (i <= -1_000_000_000) {
> q = i / 100_000_000;
> charPos -= 8;
> write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
> i = q;
> }
> }
>
>
> A simple implementation of write4DigitPairs would just call the existing writeDigitPair method 4 times:
>
>
> private static void write4DigitPairs(byte[] buf, int idx, int value) {
> int v = value;
> int v2 = v / 100;
> writeDigitPair(buf, idx + 6, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx + 4, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx + 2, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx, v - (v2 * 100));
> }
>
> There is the option to OR the 4 short values together into a long and leverage a ByteArrayLittleEndian.setLong call, but I see that the previous usage of ByteArrayLittleEndian.setShort was removed[2].
>
> A small benchmark of longs which would qualify shows up to 20% improvement.
>
> Presumably a similar change could make sense for StringUTF16, but I have not spent any time benchmarking it.
>
> Brett
>
> [1] - https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/StringLatin1.java#L163-L168
> [2] - https://github.com/openjdk/jdk/commit/913e43fea995b746fb9e1b25587d254396c7c3c9
>
>