RFR: 8335480: Only deoptimize threads if needed when closing shared arena [v3]

Mon Jul 15 16:37:53 UTC 2024

On Mon, 15 Jul 2024 16:30:11 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

>>> > > Even with `arrayElementVarHandle` it's about the same
>>> > 
>>> > 
>>> > This is very odd, and I don't have a good explanation as to why that is the case. What does the baseline (confined arena) look like for `arrayElementVarHandle` ?
>>> 
>>> Pretty much exactly the same
>> 
>> So, that means that `arrayElementVarHandle` is ~4x faster than memory segment? Isn't that a bit odd?
>
>> So, that means that `arrayElementVarHandle` is ~4x faster than memory segment? Isn't that a bit odd?
> 
> I did some more analyis of the benchmark. I first eliminated the closing thread, and started with two simple benchmarks:
> 
> 
> @Benchmark
> public int memorySegmentAccess() {
>         int sum = 0;
>         for (int i = 0; i < segment.byteSize(); i++) {
>             sum += segment.get(JAVA_BYTE, i);
>         }
>         return sum;
>     }
> 
> 
> and
> 
> 
> @Benchmark
> public int otherAccess() {
>         int sum = 0;
>         for (int i = 0; i < array.length; i++) {
>             sum += (byte)BYTE_HANDLE.get(array, i);
>         }
>         return sum;
>     }
> 
> 
> where the setup code is as follows:
> 
> 
> static final int SIZE = 10_000;
> 
>     MemorySegment segment;
>     byte[] array;
> 
>     static final VarHandle BYTE_HANDLE = MethodHandles.arrayElementVarHandle(byte[].class);
> 
>     @Setup
>     public void setup() {
>         array = new byte[SIZE];
>         segment = MemorySegment.ofArray(array);
>     }
> 
> 
> With this, I obtained the following results:
> 
> 
> Benchmark                            Mode  Cnt   Score   Error  Units
> ConcurrentClose.memorySegmentAccess  avgt   10  13.879 ± 0.478  us/op
> ConcurrentClose.otherAccess          avgt   10   2.256 ± 0.017  us/op
> 
> 
> Ugh. It seems like C2 "blows up" at the third iteration:
> 
> 
> # Run progress: 0.00% complete, ETA 00:05:00
> # Fork: 1 of 1
> # Warmup Iteration   1: 6.712 us/op
> # Warmup Iteration   2: 5.756 us/op
> # Warmup Iteration   3: 13.267 us/op
> # Warmup Iteration   4: 13.267 us/op
> # Warmup Iteration   5: 13.274 us/op
> 
> 
> This might be a bug/regression. But, let's move on. I then tweaked the induction variable of the memory segment loop to be `long`, not `int` and I got:
> 
> 
> Benchmark                            Mode  Cnt  Score   Error  Units
> ConcurrentClose.memorySegmentAccess  avgt   10  2.764 ± 0.016  us/op
> ConcurrentClose.otherAccess          avgt   10  2.240 ± 0.016  us/op
> 
> 
> Far more respectable! And now we have a good baseline, since both workloads take amount the same time, so we can use them to draw interesting comparisons. So, let's add back a thread that does a shared arena close:
> 
> 
> Benchmark                                        Mode  Cnt   Score   Error  Units
> ConcurrentClose.sharedClose                      avgt   10  12.001 ± 0.061  us/op
> ConcurrentClose.sharedClose:closing              avgt   10  19.281 ± 0.323  us/op
> ConcurrentClose.sharedClose:memorySegmentAccess  avgt   10   9.802 ± 0.314  us/op
> ConcurrentClose.sharedClose:otherAccess          avgt   1...

Thanks @mcimadamore, this sound great! I am so happy that we at least reduced the overhead for non-memory segment threads. This will also be the case for Lucene/Solr because we do not read from segments all the time, we also have other code sometimes executed between reads from memory segments :-)

So +1 to merge this and hopefully backport it at least to 21? This would be great, but as it is not a bug not strictly necessary.

We should open issues for the int problem and work on  JDK-8290892.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20158#issuecomment-2228926251