Strange problem with deoptimization on highly concurrent thread-local handshake reported to Apache Lucene

Tue Jul 2 09:48:05 UTC 2024

Hi Jorn,

many thanks for the analysis! This really looks like causing some of the 
issues and its great that you have ideas for improvement. I'd really 
like to help in solving them, mabye by benchmarking changes.

I am aware that the current code in Lucene is sub-optimal, but as this 
is mainly on transition to a new major version to Lucene we may rethink 
our I/O layer a bit.

Actually your suggestions already have options for us to implement:

  * Some files are read using some IOContext instance named READ_ONCE
    (its basically an enum). Basically that are mostly metadata index
    files which are loaded and decoded to heap. They are normally closed
    soon. We have two options to handle those: (a) use conventional I/O
    for those (but this does not alow us to use madvise - and fadvise is
    not available for FileChannel:
    https://bugs.openjdk.org/browse/JDK-8329256) or (b) use a confined
    arena for Lucene's READ_ONCE IOContext. This would dramatically
    reduce the number of short living arenas.
  * Grouping several files into the same Arena is also a good idea, but
    very hard to implement. Basically, an idea, that came to my mind
    last night, would be to keep all files with same segment number in
    the same Arena. This could be implemented by refcounting (keep a
    refcount per filename prefix) and whenever a file is opened
    increment counter and release the arena when it goes back to 0. This
    would be backwards compatible and would leave the grouping of Arenas
    to the IO layer of Lucene which unfortunately needs to have
    knowledge about file naming policy. I think we can live with that!

So I will work on this!

The small fixes you could add would be great as this should hopefully 
increase the throughput. I was not aware how the deoptimization works, 
thanks for the explanation. But still, I see a problem when the top 
frame already has a large amount of inlined code. Is it correct that 
once the handshake is done, it enables the optimized top-frame again, so 
the deoptimization does not require to actually cause a reanalysis of 
the code (so it's revertable?). If that's the case it would be fine. 
Otherwise its a lot of work to recover from the deopt.

It would really be good to only deoptimize threads that use 
MemorySegment, although this would still causing slowdown for running 
searches, because most of the Lucene query execution code is working on 
MemorySegments or at least involve a call to them.

Normal non-metadata index files are normally long living and they are 
accessed by many threads. So basically this is all fine. This works 
well, unless you have some huge "Apache Solr" instances with many many 
indexes (this was the issue of David Smiley). Although the closing of 
index files is seldom if you look at a given index, if you have mayn of 
them it gets a bit of load, if the closes are distibuted and done in 
several threads (this is what Apache Solr is doing). One improvement 
here would be to do stuff like reloading/refreshing indexes in one 
single thread for a whole Solr instance. I think Elasticsearch is doing 
tis and therefore they don't see that issue. I have to confirm this with 
them.

Last question: Do you have an idea why putting a global lock around 
Arena#close() helps for throughput?

Uwe

Am 01.07.2024 um 23:29 schrieb Jorn Vernee:
>
> Hello Uwe,
>
> I've read the various github threads, and I think I have a good idea 
> of what's going on. Just to recap how shared arena closure works:
>
> - We set the arena's isAlive = false (more or less)
> - We submit a handshake from the closing thread to all other threads
> - During that handshake, we check whether each thread is accessing the 
> arena we are trying to close.
>   - Unfortunately, this requires deoptimizing the top-most frame of a 
> thread, due to: JDK-8290892 [1]. But note that this is a 'one off' 
> unpacking of the top-most frame. After which the method finishes 
> running in the interpreter, and then goes back to executing compiled code.
> - If we find a thread accessing the same arena, we make it throw an 
> exception, to bail out of the access.
> - After the handshake finishes, we know that either: 1) no other 
> thread was accessing the arena, and they will see the up-to-date 
> isAlive = false because the handshake also works as a means of 
> synchronization. Or 2) if they were accessing the arena, they will get 
> an exception.
> - After that it's safe to free the memory.
>
> So, if shared arenas are closed very frequently, as seems to be the 
> case in the problematic scenarios, we're likely overwhelming all other 
> threads with handshakes and deoptimizations.
>
> Addressing JDK-8290892 would mean that we only need to deoptimize 
> threads that are actually accessing the arena that is being closed. 
> Another possible interim improvement could be to only deoptimize 
> threads that are actually in the middle of accessing a memory segment, 
> not just all threads. That's a relatively low-hanging fruit I noticed 
> recently, but haven't had time to look into yet. I've filed: 
> https://bugs.openjdk.org/browse/JDK-8335480
>
> This would of course still leave the handshakes, which also require a 
> bunch of VM-internal synchronization between threads. Shared arenas 
> are really only meant for long-lived lifetimes. So, either way, you 
> may want to consider ways of grouping resources into the same shared 
> arena, to reduce the total number of closures needed, or giving users 
> the option of indicating that they only need to access a resource from 
> a single thread, and then switching to a confined arena behind the 
> scenes (which is much cheaper to close, as you've found).
>
> Jorn
>
> [1]: https://bugs.openjdk.org/browse/JDK-8290892
>
> On 1-7-2024 13:52, Uwe Schindler wrote:
>>
>> Hi Panama people, hello Maurizio,
>>
>> sending this again to the mailing list, we had just private 
>> discussion with Maurizio. Maybe anyone else has an idea or might 
>> figure out what the problem is. We are not yet ready to open issue 
>> against Java 19 till 22.
>>
>> There were several issues reported by users of Apache Lucene (a 
>> wrongly-written benchmark and also Solr users) about bad performance 
>> in highly concurrent environments. Actually what was found out is 
>> that when you have many threads closing shared arenas, under some 
>> circumstances it causes all "reader threads" (those accessing 
>> MemorySegment no matter which arena they use) suddenly deoptimize. 
>> This causes immense slowdowns during Lucene searches.
>>
>> Lucily one of our committers found a workaround and we are 
>> investigating to write a benchmark shoing the issue. But first let me 
>> explain what happens:
>>
>>   * Lucene opens MemorySegments with a shared Arenas (one per file)
>>     and accesses them by multiple threads. Basically for each index
>>     file we have a shared arena which is closed when the file is closed.
>>   * There are many shared arenas (one per index file)!!!
>>   * If you close a shared arena normally you see no large delay on
>>     the thread calling the close and also no real effect on any
>>     thread that reads from other MemorySegments
>>   * But under certain circumstances ALL reading threads accessing any
>>     MemorySegment slow down dramatically! So once you close one of
>>     our Arenas, all other MemorySegments using a different shared
>>     arena suddelny get deoptimized (we have Hotspot logs showing
>>     this). Of course, the MemorySegment belonging to the closed arena
>>     is no longer used and this one should in reality the only
>>     affected one (throwing a IllegalStateEx).
>>   * The problem seems to occur mainly when multiple Arenas are closed
>>     in highly concurrent environments. This is why we did not see the
>>     issue before.
>>   * If we put a gloval lock around all calls to Arena#close() the
>>     issues seem to go away.
>>
>> I plan to write some benchmark showing this issue. Do you have an 
>> idea what could go wrong? To me it looks like race in the 
>> thread-local handshakes which may cause some crazy hotspot behaviour 
>> causing of deotimization of all threads concurrently accessing 
>> MemorySegments once Arena#close() is called in highly concurrent 
>> environments.
>>
>> This is the main issue where the observation is tracked: 
>> https://github.com/apache/lucene/issues/13325
>>
>> These are issues opened:
>>
>>   * https://github.com/dacapobench/dacapobench/issues/264 (the issue
>>     on this benchmark was of course that they wer opening/closing too
>>     often, but actually it just showed the problem, so it was very
>>     helpful). Funny detail: Alexey Shipilev opened the issue!
>>   * This was the comment showing the issue at huge installations of
>>     Apache Solr 9.7:
>>     https://github.com/apache/lucene/pull/13146#pullrequestreview-2089347714
>>     (David Smiley also talked to me at berlinbuzzwords). They had to
>>     disable MemorySegment usage in Apache SOlr/Lucene in their
>>     environment. They have machines with thousands of indexes (and
>>     therefor 10 thousands of Arenas) open at same time and the close
>>     rate is very very high!
>>
>> Uwe
>>
>> -- 
>> Uwe Schindler
>> uschindler at apache.org  
>> ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr
>> Bremen, Germany
>> https://lucene.apache.org/
>> https://solr.apache.org/

-- 
Uwe Schindler
uschindler at apache.org  
ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr
Bremen, Germany
https://lucene.apache.org/
https://solr.apache.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240702/7b89fae5/attachment-0001.htm>