[External] : Re: Strange problem with deoptimization on highly concurrent thread-local handshake reported to Apache Lucene

Jorn Vernee jorn.vernee at oracle.com
Tue Jul 2 12:19:12 UTC 2024


 > I'd really like to help in solving them, mabye by benchmarking changes.

If you could come up with a benchmark (preferably one that doesn't 
require any dependencies), that would be very useful.

 > Is it correct that once the handshake is done, it enables the 
optimized top-frame again, so the deoptimization does not require to 
actually cause a reanalysis of the code (so it's revertable?)

Yes, the deoptimization that happens doesn't throw away the compiled code.

 > Last question: Do you have an idea why putting a global lock around 
Arena#close() helps for throughput?

I think this may simply be because it slows down the rate of 
handshakes/deoptimizations, so it allows other threads to get more work 
done in the mean time. But, it also seems like something worth 
investigating.

Jorn

On 2-7-2024 11:48, Uwe Schindler wrote:
>
> Hi Jorn,
>
> many thanks for the analysis! This really looks like causing some of 
> the issues and its great that you have ideas for improvement. I'd 
> really like to help in solving them, mabye by benchmarking changes.
>
> I am aware that the current code in Lucene is sub-optimal, but as this 
> is mainly on transition to a new major version to Lucene we may 
> rethink our I/O layer a bit.
>
> Actually your suggestions already have options for us to implement:
>
>   * Some files are read using some IOContext instance named READ_ONCE
>     (its basically an enum). Basically that are mostly metadata index
>     files which are loaded and decoded to heap. They are normally
>     closed soon. We have two options to handle those: (a) use
>     conventional I/O for those (but this does not alow us to use
>     madvise - and fadvise is not available for FileChannel:
>     https://bugs.openjdk.org/browse/JDK-8329256) or (b) use a confined
>     arena for Lucene's READ_ONCE IOContext. This would dramatically
>     reduce the number of short living arenas.
>   * Grouping several files into the same Arena is also a good idea,
>     but very hard to implement. Basically, an idea, that came to my
>     mind last night, would be to keep all files with same segment
>     number in the same Arena. This could be implemented by refcounting
>     (keep a refcount per filename prefix) and whenever a file is
>     opened increment counter and release the arena when it goes back
>     to 0. This would be backwards compatible and would leave the
>     grouping of Arenas to the IO layer of Lucene which unfortunately
>     needs to have knowledge about file naming policy. I think we can
>     live with that!
>
> So I will work on this!
>
> The small fixes you could add would be great as this should hopefully 
> increase the throughput. I was not aware how the deoptimization works, 
> thanks for the explanation. But still, I see a problem when the top 
> frame already has a large amount of inlined code. Is it correct that 
> once the handshake is done, it enables the optimized top-frame again, 
> so the deoptimization does not require to actually cause a reanalysis 
> of the code (so it's revertable?). If that's the case it would be 
> fine. Otherwise its a lot of work to recover from the deopt.
>
> It would really be good to only deoptimize threads that use 
> MemorySegment, although this would still causing slowdown for running 
> searches, because most of the Lucene query execution code is working 
> on MemorySegments or at least involve a call to them.
>
> Normal non-metadata index files are normally long living and they are 
> accessed by many threads. So basically this is all fine. This works 
> well, unless you have some huge "Apache Solr" instances with many many 
> indexes (this was the issue of David Smiley). Although the closing of 
> index files is seldom if you look at a given index, if you have mayn 
> of them it gets a bit of load, if the closes are distibuted and done 
> in several threads (this is what Apache Solr is doing). One 
> improvement here would be to do stuff like reloading/refreshing 
> indexes in one single thread for a whole Solr instance. I think 
> Elasticsearch is doing tis and therefore they don't see that issue. I 
> have to confirm this with them.
>
> Last question: Do you have an idea why putting a global lock around 
> Arena#close() helps for throughput?
>
> Uwe
>
> Am 01.07.2024 um 23:29 schrieb Jorn Vernee:
>>
>> Hello Uwe,
>>
>> I've read the various github threads, and I think I have a good idea 
>> of what's going on. Just to recap how shared arena closure works:
>>
>> - We set the arena's isAlive = false (more or less)
>> - We submit a handshake from the closing thread to all other threads
>> - During that handshake, we check whether each thread is accessing 
>> the arena we are trying to close.
>>   - Unfortunately, this requires deoptimizing the top-most frame of a 
>> thread, due to: JDK-8290892 [1]. But note that this is a 'one off' 
>> unpacking of the top-most frame. After which the method finishes 
>> running in the interpreter, and then goes back to executing compiled 
>> code.
>> - If we find a thread accessing the same arena, we make it throw an 
>> exception, to bail out of the access.
>> - After the handshake finishes, we know that either: 1) no other 
>> thread was accessing the arena, and they will see the up-to-date 
>> isAlive = false because the handshake also works as a means of 
>> synchronization. Or 2) if they were accessing the arena, they will 
>> get an exception.
>> - After that it's safe to free the memory.
>>
>> So, if shared arenas are closed very frequently, as seems to be the 
>> case in the problematic scenarios, we're likely overwhelming all 
>> other threads with handshakes and deoptimizations.
>>
>> Addressing JDK-8290892 would mean that we only need to deoptimize 
>> threads that are actually accessing the arena that is being closed. 
>> Another possible interim improvement could be to only deoptimize 
>> threads that are actually in the middle of accessing a memory 
>> segment, not just all threads. That's a relatively low-hanging fruit 
>> I noticed recently, but haven't had time to look into yet. I've 
>> filed: https://bugs.openjdk.org/browse/JDK-8335480
>>
>> This would of course still leave the handshakes, which also require a 
>> bunch of VM-internal synchronization between threads. Shared arenas 
>> are really only meant for long-lived lifetimes. So, either way, you 
>> may want to consider ways of grouping resources into the same shared 
>> arena, to reduce the total number of closures needed, or giving users 
>> the option of indicating that they only need to access a resource 
>> from a single thread, and then switching to a confined arena behind 
>> the scenes (which is much cheaper to close, as you've found).
>>
>> Jorn
>>
>> [1]: https://bugs.openjdk.org/browse/JDK-8290892
>>
>> On 1-7-2024 13:52, Uwe Schindler wrote:
>>>
>>> Hi Panama people, hello Maurizio,
>>>
>>> sending this again to the mailing list, we had just private 
>>> discussion with Maurizio. Maybe anyone else has an idea or might 
>>> figure out what the problem is. We are not yet ready to open issue 
>>> against Java 19 till 22.
>>>
>>> There were several issues reported by users of Apache Lucene (a 
>>> wrongly-written benchmark and also Solr users) about bad performance 
>>> in highly concurrent environments. Actually what was found out is 
>>> that when you have many threads closing shared arenas, under some 
>>> circumstances it causes all "reader threads" (those accessing 
>>> MemorySegment no matter which arena they use) suddenly deoptimize. 
>>> This causes immense slowdowns during Lucene searches.
>>>
>>> Lucily one of our committers found a workaround and we are 
>>> investigating to write a benchmark shoing the issue. But first let 
>>> me explain what happens:
>>>
>>>   * Lucene opens MemorySegments with a shared Arenas (one per file)
>>>     and accesses them by multiple threads. Basically for each index
>>>     file we have a shared arena which is closed when the file is closed.
>>>   * There are many shared arenas (one per index file)!!!
>>>   * If you close a shared arena normally you see no large delay on
>>>     the thread calling the close and also no real effect on any
>>>     thread that reads from other MemorySegments
>>>   * But under certain circumstances ALL reading threads accessing
>>>     any MemorySegment slow down dramatically! So once you close one
>>>     of our Arenas, all other MemorySegments using a different shared
>>>     arena suddelny get deoptimized (we have Hotspot logs showing
>>>     this). Of course, the MemorySegment belonging to the closed
>>>     arena is no longer used and this one should in reality the only
>>>     affected one (throwing a IllegalStateEx).
>>>   * The problem seems to occur mainly when multiple Arenas are
>>>     closed in highly concurrent environments. This is why we did not
>>>     see the issue before.
>>>   * If we put a gloval lock around all calls to Arena#close() the
>>>     issues seem to go away.
>>>
>>> I plan to write some benchmark showing this issue. Do you have an 
>>> idea what could go wrong? To me it looks like race in the 
>>> thread-local handshakes which may cause some crazy hotspot behaviour 
>>> causing of deotimization of all threads concurrently accessing 
>>> MemorySegments once Arena#close() is called in highly concurrent 
>>> environments.
>>>
>>> This is the main issue where the observation is tracked: 
>>> https://github.com/apache/lucene/issues/13325
>>>
>>> These are issues opened:
>>>
>>>   * https://github.com/dacapobench/dacapobench/issues/264 (the issue
>>>     on this benchmark was of course that they wer opening/closing
>>>     too often, but actually it just showed the problem, so it was
>>>     very helpful). Funny detail: Alexey Shipilev opened the issue!
>>>   * This was the comment showing the issue at huge installations of
>>>     Apache Solr 9.7:
>>>     https://github.com/apache/lucene/pull/13146#pullrequestreview-2089347714
>>>     (David Smiley also talked to me at berlinbuzzwords). They had to
>>>     disable MemorySegment usage in Apache SOlr/Lucene in their
>>>     environment. They have machines with thousands of indexes (and
>>>     therefor 10 thousands of Arenas) open at same time and the close
>>>     rate is very very high!
>>>
>>> Uwe
>>>
>>> -- 
>>> Uwe Schindler
>>> uschindler at apache.org  
>>> ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr
>>> Bremen, Germany
>>> https://lucene.apache.org/
>>> https://solr.apache.org/
> -- 
> Uwe Schindler
> uschindler at apache.org  
> ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr
> Bremen, Germany
> https://lucene.apache.org/
> https://solr.apache.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240702/570fdcbf/attachment-0001.htm>


More information about the panama-dev mailing list