[External] : Re: Strange problem with deoptimization on highly concurrent thread-local handshake reported to Apache Lucene

Wed Jul 10 10:53:26 UTC 2024

I've been looking into this. It turns out that we actually need to 
deoptimize in cases where we are /not/ inside an @Scoped method, due to 
the JIT creating the following code shape:

liveness check (from @Scoped method)
for (...) {
     for (...) { // strip-mining inner loop
         memory access (from @Scoped method)
     }
     safepoint <-- STOPPED HERE
}

In a case like this, we might do a memory access after the safepoint, 
without doing a liveness check first. This is problematic because the 
memory might be attached to the scope that has been closed. So, in a 
case like that, deoptimizing gets us back to a code shape like this:

for (...) {
     call to ScopedMemoryAccess
     safepoint <-- STOPPED HERE
}

And now we will do a liveness check before the next memory access. If 
the scope is closed at that point, we get an exception.

Now, we could conditionally deoptimize only if the scope of the segment 
is alive at this safepoint, after [1] is fixed, but... Unfortunately, 
even if I disable the deoptimizations altogether, it doesn't really have 
any affect on the benchmark I'm testing with [2]. So, it looks like the 
main performance cost from closing a shared arena is not coming from 
deoptimizing.

I'll keep looking at this, but as far as optimizing this use case goes, 
deoptimization looks like a dead end.

Jorn

[1]: https://bugs.openjdk.org/browse/JDK-8290892
[2]: 
https://github.com/JornVernee/jdk/blob/FasterSharedClose/test/micro/org/openjdk/bench/java/lang/foreign/ConcurrentClose.java

On 2-7-2024 14:19, Jorn Vernee wrote:
>
> > I'd really like to help in solving them, mabye by benchmarking changes.
>
> If you could come up with a benchmark (preferably one that doesn't 
> require any dependencies), that would be very useful.
>
> > Is it correct that once the handshake is done, it enables the 
> optimized top-frame again, so the deoptimization does not require to 
> actually cause a reanalysis of the code (so it's revertable?)
>
> Yes, the deoptimization that happens doesn't throw away the compiled code.
>
> > Last question: Do you have an idea why putting a global lock around 
> Arena#close() helps for throughput?
>
> I think this may simply be because it slows down the rate of 
> handshakes/deoptimizations, so it allows other threads to get more 
> work done in the mean time. But, it also seems like something worth 
> investigating.
>
> Jorn
>
> On 2-7-2024 11:48, Uwe Schindler wrote:
>>
>> Hi Jorn,
>>
>> many thanks for the analysis! This really looks like causing some of 
>> the issues and its great that you have ideas for improvement. I'd 
>> really like to help in solving them, mabye by benchmarking changes.
>>
>> I am aware that the current code in Lucene is sub-optimal, but as 
>> this is mainly on transition to a new major version to Lucene we may 
>> rethink our I/O layer a bit.
>>
>> Actually your suggestions already have options for us to implement:
>>
>>   * Some files are read using some IOContext instance named READ_ONCE
>>     (its basically an enum). Basically that are mostly metadata index
>>     files which are loaded and decoded to heap. They are normally
>>     closed soon. We have two options to handle those: (a) use
>>     conventional I/O for those (but this does not alow us to use
>>     madvise - and fadvise is not available for FileChannel:
>>     https://bugs.openjdk.org/browse/JDK-8329256) or (b) use a
>>     confined arena for Lucene's READ_ONCE IOContext. This would
>>     dramatically reduce the number of short living arenas.
>>   * Grouping several files into the same Arena is also a good idea,
>>     but very hard to implement. Basically, an idea, that came to my
>>     mind last night, would be to keep all files with same segment
>>     number in the same Arena. This could be implemented by
>>     refcounting (keep a refcount per filename prefix) and whenever a
>>     file is opened increment counter and release the arena when it
>>     goes back to 0. This would be backwards compatible and would
>>     leave the grouping of Arenas to the IO layer of Lucene which
>>     unfortunately needs to have knowledge about file naming policy. I
>>     think we can live with that!
>>
>> So I will work on this!
>>
>> The small fixes you could add would be great as this should hopefully 
>> increase the throughput. I was not aware how the deoptimization 
>> works, thanks for the explanation. But still, I see a problem when 
>> the top frame already has a large amount of inlined code. Is it 
>> correct that once the handshake is done, it enables the optimized 
>> top-frame again, so the deoptimization does not require to actually 
>> cause a reanalysis of the code (so it's revertable?). If that's the 
>> case it would be fine. Otherwise its a lot of work to recover from 
>> the deopt.
>>
>> It would really be good to only deoptimize threads that use 
>> MemorySegment, although this would still causing slowdown for running 
>> searches, because most of the Lucene query execution code is working 
>> on MemorySegments or at least involve a call to them.
>>
>> Normal non-metadata index files are normally long living and they are 
>> accessed by many threads. So basically this is all fine. This works 
>> well, unless you have some huge "Apache Solr" instances with many 
>> many indexes (this was the issue of David Smiley). Although the 
>> closing of index files is seldom if you look at a given index, if you 
>> have mayn of them it gets a bit of load, if the closes are distibuted 
>> and done in several threads (this is what Apache Solr is doing). One 
>> improvement here would be to do stuff like reloading/refreshing 
>> indexes in one single thread for a whole Solr instance. I think 
>> Elasticsearch is doing tis and therefore they don't see that issue. I 
>> have to confirm this with them.
>>
>> Last question: Do you have an idea why putting a global lock around 
>> Arena#close() helps for throughput?
>>
>> Uwe
>>
>> Am 01.07.2024 um 23:29 schrieb Jorn Vernee:
>>>
>>> Hello Uwe,
>>>
>>> I've read the various github threads, and I think I have a good idea 
>>> of what's going on. Just to recap how shared arena closure works:
>>>
>>> - We set the arena's isAlive = false (more or less)
>>> - We submit a handshake from the closing thread to all other threads
>>> - During that handshake, we check whether each thread is accessing 
>>> the arena we are trying to close.
>>>   - Unfortunately, this requires deoptimizing the top-most frame of 
>>> a thread, due to: JDK-8290892 [1]. But note that this is a 'one off' 
>>> unpacking of the top-most frame. After which the method finishes 
>>> running in the interpreter, and then goes back to executing compiled 
>>> code.
>>> - If we find a thread accessing the same arena, we make it throw an 
>>> exception, to bail out of the access.
>>> - After the handshake finishes, we know that either: 1) no other 
>>> thread was accessing the arena, and they will see the up-to-date 
>>> isAlive = false because the handshake also works as a means of 
>>> synchronization. Or 2) if they were accessing the arena, they will 
>>> get an exception.
>>> - After that it's safe to free the memory.
>>>
>>> So, if shared arenas are closed very frequently, as seems to be the 
>>> case in the problematic scenarios, we're likely overwhelming all 
>>> other threads with handshakes and deoptimizations.
>>>
>>> Addressing JDK-8290892 would mean that we only need to deoptimize 
>>> threads that are actually accessing the arena that is being closed. 
>>> Another possible interim improvement could be to only deoptimize 
>>> threads that are actually in the middle of accessing a memory 
>>> segment, not just all threads. That's a relatively low-hanging fruit 
>>> I noticed recently, but haven't had time to look into yet. I've 
>>> filed: https://bugs.openjdk.org/browse/JDK-8335480
>>>
>>> This would of course still leave the handshakes, which also require 
>>> a bunch of VM-internal synchronization between threads. Shared 
>>> arenas are really only meant for long-lived lifetimes. So, either 
>>> way, you may want to consider ways of grouping resources into the 
>>> same shared arena, to reduce the total number of closures needed, or 
>>> giving users the option of indicating that they only need to access 
>>> a resource from a single thread, and then switching to a confined 
>>> arena behind the scenes (which is much cheaper to close, as you've 
>>> found).
>>>
>>> Jorn
>>>
>>> [1]: https://bugs.openjdk.org/browse/JDK-8290892
>>>
>>> On 1-7-2024 13:52, Uwe Schindler wrote:
>>>>
>>>> Hi Panama people, hello Maurizio,
>>>>
>>>> sending this again to the mailing list, we had just private 
>>>> discussion with Maurizio. Maybe anyone else has an idea or might 
>>>> figure out what the problem is. We are not yet ready to open issue 
>>>> against Java 19 till 22.
>>>>
>>>> There were several issues reported by users of Apache Lucene (a 
>>>> wrongly-written benchmark and also Solr users) about bad 
>>>> performance in highly concurrent environments. Actually what was 
>>>> found out is that when you have many threads closing shared arenas, 
>>>> under some circumstances it causes all "reader threads" (those 
>>>> accessing MemorySegment no matter which arena they use) suddenly 
>>>> deoptimize. This causes immense slowdowns during Lucene searches.
>>>>
>>>> Lucily one of our committers found a workaround and we are 
>>>> investigating to write a benchmark shoing the issue. But first let 
>>>> me explain what happens:
>>>>
>>>>   * Lucene opens MemorySegments with a shared Arenas (one per file)
>>>>     and accesses them by multiple threads. Basically for each index
>>>>     file we have a shared arena which is closed when the file is
>>>>     closed.
>>>>   * There are many shared arenas (one per index file)!!!
>>>>   * If you close a shared arena normally you see no large delay on
>>>>     the thread calling the close and also no real effect on any
>>>>     thread that reads from other MemorySegments
>>>>   * But under certain circumstances ALL reading threads accessing
>>>>     any MemorySegment slow down dramatically! So once you close one
>>>>     of our Arenas, all other MemorySegments using a different
>>>>     shared arena suddelny get deoptimized (we have Hotspot logs
>>>>     showing this). Of course, the MemorySegment belonging to the
>>>>     closed arena is no longer used and this one should in reality
>>>>     the only affected one (throwing a IllegalStateEx).
>>>>   * The problem seems to occur mainly when multiple Arenas are
>>>>     closed in highly concurrent environments. This is why we did
>>>>     not see the issue before.
>>>>   * If we put a gloval lock around all calls to Arena#close() the
>>>>     issues seem to go away.
>>>>
>>>> I plan to write some benchmark showing this issue. Do you have an 
>>>> idea what could go wrong? To me it looks like race in the 
>>>> thread-local handshakes which may cause some crazy hotspot 
>>>> behaviour causing of deotimization of all threads concurrently 
>>>> accessing MemorySegments once Arena#close() is called in highly 
>>>> concurrent environments.
>>>>
>>>> This is the main issue where the observation is tracked: 
>>>> https://github.com/apache/lucene/issues/13325
>>>>
>>>> These are issues opened:
>>>>
>>>>   * https://github.com/dacapobench/dacapobench/issues/264 (the
>>>>     issue on this benchmark was of course that they wer
>>>>     opening/closing too often, but actually it just showed the
>>>>     problem, so it was very helpful). Funny detail: Alexey Shipilev
>>>>     opened the issue!
>>>>   * This was the comment showing the issue at huge installations of
>>>>     Apache Solr 9.7:
>>>>     https://github.com/apache/lucene/pull/13146#pullrequestreview-2089347714
>>>>     (David Smiley also talked to me at berlinbuzzwords). They had
>>>>     to disable MemorySegment usage in Apache SOlr/Lucene in their
>>>>     environment. They have machines with thousands of indexes (and
>>>>     therefor 10 thousands of Arenas) open at same time and the
>>>>     close rate is very very high!
>>>>
>>>> Uwe
>>>>
>>>> -- 
>>>> Uwe Schindler
>>>> uschindler at apache.org  
>>>> ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr
>>>> Bremen, Germany
>>>> https://lucene.apache.org/
>>>> https://solr.apache.org/
>> -- 
>> Uwe Schindler
>> uschindler at apache.org  
>> ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr
>> Bremen, Germany
>> https://lucene.apache.org/
>> https://solr.apache.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240710/cad02b72/attachment-0001.htm>