RFR: 8335480: Only deoptimize threads if needed when closing shared arena [v3]
Uwe Schindler
uschindler at openjdk.org
Mon Jul 15 16:37:53 UTC 2024
On Mon, 15 Jul 2024 16:30:11 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:
>>> > > Even with `arrayElementVarHandle` it's about the same
>>> >
>>> >
>>> > This is very odd, and I don't have a good explanation as to why that is the case. What does the baseline (confined arena) look like for `arrayElementVarHandle` ?
>>>
>>> Pretty much exactly the same
>>
>> So, that means that `arrayElementVarHandle` is ~4x faster than memory segment? Isn't that a bit odd?
>
>> So, that means that `arrayElementVarHandle` is ~4x faster than memory segment? Isn't that a bit odd?
>
> I did some more analyis of the benchmark. I first eliminated the closing thread, and started with two simple benchmarks:
>
>
> @Benchmark
> public int memorySegmentAccess() {
> int sum = 0;
> for (int i = 0; i < segment.byteSize(); i++) {
> sum += segment.get(JAVA_BYTE, i);
> }
> return sum;
> }
>
>
> and
>
>
> @Benchmark
> public int otherAccess() {
> int sum = 0;
> for (int i = 0; i < array.length; i++) {
> sum += (byte)BYTE_HANDLE.get(array, i);
> }
> return sum;
> }
>
>
> where the setup code is as follows:
>
>
> static final int SIZE = 10_000;
>
> MemorySegment segment;
> byte[] array;
>
> static final VarHandle BYTE_HANDLE = MethodHandles.arrayElementVarHandle(byte[].class);
>
> @Setup
> public void setup() {
> array = new byte[SIZE];
> segment = MemorySegment.ofArray(array);
> }
>
>
> With this, I obtained the following results:
>
>
> Benchmark Mode Cnt Score Error Units
> ConcurrentClose.memorySegmentAccess avgt 10 13.879 ± 0.478 us/op
> ConcurrentClose.otherAccess avgt 10 2.256 ± 0.017 us/op
>
>
> Ugh. It seems like C2 "blows up" at the third iteration:
>
>
> # Run progress: 0.00% complete, ETA 00:05:00
> # Fork: 1 of 1
> # Warmup Iteration 1: 6.712 us/op
> # Warmup Iteration 2: 5.756 us/op
> # Warmup Iteration 3: 13.267 us/op
> # Warmup Iteration 4: 13.267 us/op
> # Warmup Iteration 5: 13.274 us/op
>
>
> This might be a bug/regression. But, let's move on. I then tweaked the induction variable of the memory segment loop to be `long`, not `int` and I got:
>
>
> Benchmark Mode Cnt Score Error Units
> ConcurrentClose.memorySegmentAccess avgt 10 2.764 ± 0.016 us/op
> ConcurrentClose.otherAccess avgt 10 2.240 ± 0.016 us/op
>
>
> Far more respectable! And now we have a good baseline, since both workloads take amount the same time, so we can use them to draw interesting comparisons. So, let's add back a thread that does a shared arena close:
>
>
> Benchmark Mode Cnt Score Error Units
> ConcurrentClose.sharedClose avgt 10 12.001 ± 0.061 us/op
> ConcurrentClose.sharedClose:closing avgt 10 19.281 ± 0.323 us/op
> ConcurrentClose.sharedClose:memorySegmentAccess avgt 10 9.802 ± 0.314 us/op
> ConcurrentClose.sharedClose:otherAccess avgt 1...
Thanks @mcimadamore, this sound great! I am so happy that we at least reduced the overhead for non-memory segment threads. This will also be the case for Lucene/Solr because we do not read from segments all the time, we also have other code sometimes executed between reads from memory segments :-)
So +1 to merge this and hopefully backport it at least to 21? This would be great, but as it is not a bug not strictly necessary.
We should open issues for the int problem and work on JDK-8290892.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20158#issuecomment-2228926251
More information about the core-libs-dev
mailing list