[External] : Re: Strange problem with deoptimization on highly concurrent thread-local handshake reported to Apache Lucene
Jorn Vernee
jorn.vernee at oracle.com
Tue Jul 2 12:19:12 UTC 2024
> I'd really like to help in solving them, mabye by benchmarking changes.
If you could come up with a benchmark (preferably one that doesn't
require any dependencies), that would be very useful.
> Is it correct that once the handshake is done, it enables the
optimized top-frame again, so the deoptimization does not require to
actually cause a reanalysis of the code (so it's revertable?)
Yes, the deoptimization that happens doesn't throw away the compiled code.
> Last question: Do you have an idea why putting a global lock around
Arena#close() helps for throughput?
I think this may simply be because it slows down the rate of
handshakes/deoptimizations, so it allows other threads to get more work
done in the mean time. But, it also seems like something worth
investigating.
Jorn
On 2-7-2024 11:48, Uwe Schindler wrote:
>
> Hi Jorn,
>
> many thanks for the analysis! This really looks like causing some of
> the issues and its great that you have ideas for improvement. I'd
> really like to help in solving them, mabye by benchmarking changes.
>
> I am aware that the current code in Lucene is sub-optimal, but as this
> is mainly on transition to a new major version to Lucene we may
> rethink our I/O layer a bit.
>
> Actually your suggestions already have options for us to implement:
>
> * Some files are read using some IOContext instance named READ_ONCE
> (its basically an enum). Basically that are mostly metadata index
> files which are loaded and decoded to heap. They are normally
> closed soon. We have two options to handle those: (a) use
> conventional I/O for those (but this does not alow us to use
> madvise - and fadvise is not available for FileChannel:
> https://bugs.openjdk.org/browse/JDK-8329256) or (b) use a confined
> arena for Lucene's READ_ONCE IOContext. This would dramatically
> reduce the number of short living arenas.
> * Grouping several files into the same Arena is also a good idea,
> but very hard to implement. Basically, an idea, that came to my
> mind last night, would be to keep all files with same segment
> number in the same Arena. This could be implemented by refcounting
> (keep a refcount per filename prefix) and whenever a file is
> opened increment counter and release the arena when it goes back
> to 0. This would be backwards compatible and would leave the
> grouping of Arenas to the IO layer of Lucene which unfortunately
> needs to have knowledge about file naming policy. I think we can
> live with that!
>
> So I will work on this!
>
> The small fixes you could add would be great as this should hopefully
> increase the throughput. I was not aware how the deoptimization works,
> thanks for the explanation. But still, I see a problem when the top
> frame already has a large amount of inlined code. Is it correct that
> once the handshake is done, it enables the optimized top-frame again,
> so the deoptimization does not require to actually cause a reanalysis
> of the code (so it's revertable?). If that's the case it would be
> fine. Otherwise its a lot of work to recover from the deopt.
>
> It would really be good to only deoptimize threads that use
> MemorySegment, although this would still causing slowdown for running
> searches, because most of the Lucene query execution code is working
> on MemorySegments or at least involve a call to them.
>
> Normal non-metadata index files are normally long living and they are
> accessed by many threads. So basically this is all fine. This works
> well, unless you have some huge "Apache Solr" instances with many many
> indexes (this was the issue of David Smiley). Although the closing of
> index files is seldom if you look at a given index, if you have mayn
> of them it gets a bit of load, if the closes are distibuted and done
> in several threads (this is what Apache Solr is doing). One
> improvement here would be to do stuff like reloading/refreshing
> indexes in one single thread for a whole Solr instance. I think
> Elasticsearch is doing tis and therefore they don't see that issue. I
> have to confirm this with them.
>
> Last question: Do you have an idea why putting a global lock around
> Arena#close() helps for throughput?
>
> Uwe
>
> Am 01.07.2024 um 23:29 schrieb Jorn Vernee:
>>
>> Hello Uwe,
>>
>> I've read the various github threads, and I think I have a good idea
>> of what's going on. Just to recap how shared arena closure works:
>>
>> - We set the arena's isAlive = false (more or less)
>> - We submit a handshake from the closing thread to all other threads
>> - During that handshake, we check whether each thread is accessing
>> the arena we are trying to close.
>> - Unfortunately, this requires deoptimizing the top-most frame of a
>> thread, due to: JDK-8290892 [1]. But note that this is a 'one off'
>> unpacking of the top-most frame. After which the method finishes
>> running in the interpreter, and then goes back to executing compiled
>> code.
>> - If we find a thread accessing the same arena, we make it throw an
>> exception, to bail out of the access.
>> - After the handshake finishes, we know that either: 1) no other
>> thread was accessing the arena, and they will see the up-to-date
>> isAlive = false because the handshake also works as a means of
>> synchronization. Or 2) if they were accessing the arena, they will
>> get an exception.
>> - After that it's safe to free the memory.
>>
>> So, if shared arenas are closed very frequently, as seems to be the
>> case in the problematic scenarios, we're likely overwhelming all
>> other threads with handshakes and deoptimizations.
>>
>> Addressing JDK-8290892 would mean that we only need to deoptimize
>> threads that are actually accessing the arena that is being closed.
>> Another possible interim improvement could be to only deoptimize
>> threads that are actually in the middle of accessing a memory
>> segment, not just all threads. That's a relatively low-hanging fruit
>> I noticed recently, but haven't had time to look into yet. I've
>> filed: https://bugs.openjdk.org/browse/JDK-8335480
>>
>> This would of course still leave the handshakes, which also require a
>> bunch of VM-internal synchronization between threads. Shared arenas
>> are really only meant for long-lived lifetimes. So, either way, you
>> may want to consider ways of grouping resources into the same shared
>> arena, to reduce the total number of closures needed, or giving users
>> the option of indicating that they only need to access a resource
>> from a single thread, and then switching to a confined arena behind
>> the scenes (which is much cheaper to close, as you've found).
>>
>> Jorn
>>
>> [1]: https://bugs.openjdk.org/browse/JDK-8290892
>>
>> On 1-7-2024 13:52, Uwe Schindler wrote:
>>>
>>> Hi Panama people, hello Maurizio,
>>>
>>> sending this again to the mailing list, we had just private
>>> discussion with Maurizio. Maybe anyone else has an idea or might
>>> figure out what the problem is. We are not yet ready to open issue
>>> against Java 19 till 22.
>>>
>>> There were several issues reported by users of Apache Lucene (a
>>> wrongly-written benchmark and also Solr users) about bad performance
>>> in highly concurrent environments. Actually what was found out is
>>> that when you have many threads closing shared arenas, under some
>>> circumstances it causes all "reader threads" (those accessing
>>> MemorySegment no matter which arena they use) suddenly deoptimize.
>>> This causes immense slowdowns during Lucene searches.
>>>
>>> Lucily one of our committers found a workaround and we are
>>> investigating to write a benchmark shoing the issue. But first let
>>> me explain what happens:
>>>
>>> * Lucene opens MemorySegments with a shared Arenas (one per file)
>>> and accesses them by multiple threads. Basically for each index
>>> file we have a shared arena which is closed when the file is closed.
>>> * There are many shared arenas (one per index file)!!!
>>> * If you close a shared arena normally you see no large delay on
>>> the thread calling the close and also no real effect on any
>>> thread that reads from other MemorySegments
>>> * But under certain circumstances ALL reading threads accessing
>>> any MemorySegment slow down dramatically! So once you close one
>>> of our Arenas, all other MemorySegments using a different shared
>>> arena suddelny get deoptimized (we have Hotspot logs showing
>>> this). Of course, the MemorySegment belonging to the closed
>>> arena is no longer used and this one should in reality the only
>>> affected one (throwing a IllegalStateEx).
>>> * The problem seems to occur mainly when multiple Arenas are
>>> closed in highly concurrent environments. This is why we did not
>>> see the issue before.
>>> * If we put a gloval lock around all calls to Arena#close() the
>>> issues seem to go away.
>>>
>>> I plan to write some benchmark showing this issue. Do you have an
>>> idea what could go wrong? To me it looks like race in the
>>> thread-local handshakes which may cause some crazy hotspot behaviour
>>> causing of deotimization of all threads concurrently accessing
>>> MemorySegments once Arena#close() is called in highly concurrent
>>> environments.
>>>
>>> This is the main issue where the observation is tracked:
>>> https://github.com/apache/lucene/issues/13325
>>>
>>> These are issues opened:
>>>
>>> * https://github.com/dacapobench/dacapobench/issues/264 (the issue
>>> on this benchmark was of course that they wer opening/closing
>>> too often, but actually it just showed the problem, so it was
>>> very helpful). Funny detail: Alexey Shipilev opened the issue!
>>> * This was the comment showing the issue at huge installations of
>>> Apache Solr 9.7:
>>> https://github.com/apache/lucene/pull/13146#pullrequestreview-2089347714
>>> (David Smiley also talked to me at berlinbuzzwords). They had to
>>> disable MemorySegment usage in Apache SOlr/Lucene in their
>>> environment. They have machines with thousands of indexes (and
>>> therefor 10 thousands of Arenas) open at same time and the close
>>> rate is very very high!
>>>
>>> Uwe
>>>
>>> --
>>> Uwe Schindler
>>> uschindler at apache.org
>>> ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr
>>> Bremen, Germany
>>> https://lucene.apache.org/
>>> https://solr.apache.org/
> --
> Uwe Schindler
> uschindler at apache.org
> ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr
> Bremen, Germany
> https://lucene.apache.org/
> https://solr.apache.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240702/570fdcbf/attachment-0001.htm>
More information about the panama-dev
mailing list