large longs to string
Brett Okken
brett.okken.os at gmail.com
Wed Apr 24 23:42:44 UTC 2024
Claes,
> IIUC this optimization leans on 4 long divs being slower than 1 long div + 4 int divs
Exactly. I think there is also some benefit from unrolling the 4 int
digit pair operations.
> which might not be true on all platforms, nor stay true in the future
Agreed, but I am not sure how to reason about that.
> Long values will in practice likely be biased towards lower values, so it’s important that any optimization to .. longer values doesn’t regress inputs in the int range. Since it’s all guarded by a test that is already there there shouldn’t be much room for a difference, but adding code can cause interesting issues so it’s always worth measuring to make sure. Have you run any benchmark for inputs smaller than the threshold? And for a healthy mix of values?
I had been focused on "longer" values, as I have uses which are
effectively guaranteed to have some bits from 53-63 set.
I added some tests for "int" values as well as a mix with 2 "longs"
and 3 "ints". This showed a pretty small regression for the int and
mixed case. That regression went away by changing back to a while loop
comparing to Integer.MIN_VALUE, and that did not give up much of the
gain.
private static final long[] ints = new long[] {Integer.MAX_VALUE,
12345, -5432, 654321, 5};
private static final long[] longs = new long[] {235789987235L,
235789987235326L, -98975433216803632L, Long.MAX_VALUE};
private static final long[] mixed = new long[] {5, Long.MIN_VALUE,
654321, 9876543210L, 543};
Benchmark (type) Mode Cnt Score Error Units
LongToStringBenchmark.baseline int avgt 3 24.105 ┬▒ 11.751 ns/op
LongToStringBenchmark.baseline long avgt 3 51.223 ┬▒ 21.069 ns/op
LongToStringBenchmark.baseline mix avgt 3 34.849 ┬▒ 7.270 ns/op
LongToStringBenchmark.change int avgt 3 25.846 ┬▒ 2.012 ns/op
LongToStringBenchmark.change long avgt 3 43.053 ┬▒ 13.886 ns/op
LongToStringBenchmark.change mix avgt 3 35.826 ┬▒ 2.901 ns/op
LongToStringBenchmark.changeLoop int avgt 3 24.261 ┬▒ 7.325 ns/op
LongToStringBenchmark.changeLoop long avgt 3 44.254 ┬▒ 22.921 ns/op
LongToStringBenchmark.changeLoop mix avgt 3 29.501 ┬▒ 8.693 ns/op
"change" is what I had proposed originally and "changeLoop" is:
while (i <= Integer.MIN_VALUE) {
long q = i / 100_000_000;
charPos -= 8;
write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
v = q;
}
Brett
On Wed, Apr 24, 2024 at 2:39 PM Claes Redestad
<claes.redestad at oracle.com> wrote:
>
> Hi,
>
> IIUC this optimization leans on 4 long divs being slower than 1 long div + 4 int divs, which might not be true on all platforms, nor stay true in the future. Long values will in practice likely be biased towards lower values, so it’s important that any optimization to .. longer values doesn’t regress inputs in the int range. Since it’s all guarded by a test that is already there there shouldn’t be much room for a difference, but adding code can cause interesting issues so it’s always worth measuring to make sure. Have you run any benchmark for inputs smaller than the threshold? And for a healthy mix of values?
>
> Thanks!
> /Claes
>
> 24 apr. 2024 kl. 21:08 skrev Brett Okken <brett.okken.os at gmail.com>:
>
> Is there interest in optimizing StringLatin1.getChars(long, int, byte[]) for large (larger than int) long values[1]?
> We can change this to work with 8 digits at a time, which reduces the amount of 64 bit arithmetic required.
>
> if (i <= -1_000_000_000) {
> long q = i / 100_000_000;
> charPos -= 8;
> write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
> i = q;
> if (i <= -1_000_000_000) {
> q = i / 100_000_000;
> charPos -= 8;
> write4DigitPairs(buf, charPos, (int) ((q * 100_000_000) - i));
> i = q;
> }
> }
>
>
> A simple implementation of write4DigitPairs would just call the existing writeDigitPair method 4 times:
>
>
> private static void write4DigitPairs(byte[] buf, int idx, int value) {
> int v = value;
> int v2 = v / 100;
> writeDigitPair(buf, idx + 6, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx + 4, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx + 2, v - (v2 * 100));
> v = v2;
>
> v2 = v / 100;
> writeDigitPair(buf, idx, v - (v2 * 100));
> }
>
> There is the option to OR the 4 short values together into a long and leverage a ByteArrayLittleEndian.setLong call, but I see that the previous usage of ByteArrayLittleEndian.setShort was removed[2].
>
> A small benchmark of longs which would qualify shows up to 20% improvement.
>
> Presumably a similar change could make sense for StringUTF16, but I have not spent any time benchmarking it.
>
> Brett
>
> [1] - https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/StringLatin1.java#L163-L168
> [2] - https://github.com/openjdk/jdk/commit/913e43fea995b746fb9e1b25587d254396c7c3c9
>
>
More information about the core-libs-dev
mailing list