Strange problem with deoptimization on highly concurrent thread-local handshake reported to Apache Lucene
Uwe Schindler
uschindler at apache.org
Tue Jul 2 09:48:05 UTC 2024
Hi Jorn,
many thanks for the analysis! This really looks like causing some of the
issues and its great that you have ideas for improvement. I'd really
like to help in solving them, mabye by benchmarking changes.
I am aware that the current code in Lucene is sub-optimal, but as this
is mainly on transition to a new major version to Lucene we may rethink
our I/O layer a bit.
Actually your suggestions already have options for us to implement:
* Some files are read using some IOContext instance named READ_ONCE
(its basically an enum). Basically that are mostly metadata index
files which are loaded and decoded to heap. They are normally closed
soon. We have two options to handle those: (a) use conventional I/O
for those (but this does not alow us to use madvise - and fadvise is
not available for FileChannel:
https://bugs.openjdk.org/browse/JDK-8329256) or (b) use a confined
arena for Lucene's READ_ONCE IOContext. This would dramatically
reduce the number of short living arenas.
* Grouping several files into the same Arena is also a good idea, but
very hard to implement. Basically, an idea, that came to my mind
last night, would be to keep all files with same segment number in
the same Arena. This could be implemented by refcounting (keep a
refcount per filename prefix) and whenever a file is opened
increment counter and release the arena when it goes back to 0. This
would be backwards compatible and would leave the grouping of Arenas
to the IO layer of Lucene which unfortunately needs to have
knowledge about file naming policy. I think we can live with that!
So I will work on this!
The small fixes you could add would be great as this should hopefully
increase the throughput. I was not aware how the deoptimization works,
thanks for the explanation. But still, I see a problem when the top
frame already has a large amount of inlined code. Is it correct that
once the handshake is done, it enables the optimized top-frame again, so
the deoptimization does not require to actually cause a reanalysis of
the code (so it's revertable?). If that's the case it would be fine.
Otherwise its a lot of work to recover from the deopt.
It would really be good to only deoptimize threads that use
MemorySegment, although this would still causing slowdown for running
searches, because most of the Lucene query execution code is working on
MemorySegments or at least involve a call to them.
Normal non-metadata index files are normally long living and they are
accessed by many threads. So basically this is all fine. This works
well, unless you have some huge "Apache Solr" instances with many many
indexes (this was the issue of David Smiley). Although the closing of
index files is seldom if you look at a given index, if you have mayn of
them it gets a bit of load, if the closes are distibuted and done in
several threads (this is what Apache Solr is doing). One improvement
here would be to do stuff like reloading/refreshing indexes in one
single thread for a whole Solr instance. I think Elasticsearch is doing
tis and therefore they don't see that issue. I have to confirm this with
them.
Last question: Do you have an idea why putting a global lock around
Arena#close() helps for throughput?
Uwe
Am 01.07.2024 um 23:29 schrieb Jorn Vernee:
>
> Hello Uwe,
>
> I've read the various github threads, and I think I have a good idea
> of what's going on. Just to recap how shared arena closure works:
>
> - We set the arena's isAlive = false (more or less)
> - We submit a handshake from the closing thread to all other threads
> - During that handshake, we check whether each thread is accessing the
> arena we are trying to close.
> - Unfortunately, this requires deoptimizing the top-most frame of a
> thread, due to: JDK-8290892 [1]. But note that this is a 'one off'
> unpacking of the top-most frame. After which the method finishes
> running in the interpreter, and then goes back to executing compiled code.
> - If we find a thread accessing the same arena, we make it throw an
> exception, to bail out of the access.
> - After the handshake finishes, we know that either: 1) no other
> thread was accessing the arena, and they will see the up-to-date
> isAlive = false because the handshake also works as a means of
> synchronization. Or 2) if they were accessing the arena, they will get
> an exception.
> - After that it's safe to free the memory.
>
> So, if shared arenas are closed very frequently, as seems to be the
> case in the problematic scenarios, we're likely overwhelming all other
> threads with handshakes and deoptimizations.
>
> Addressing JDK-8290892 would mean that we only need to deoptimize
> threads that are actually accessing the arena that is being closed.
> Another possible interim improvement could be to only deoptimize
> threads that are actually in the middle of accessing a memory segment,
> not just all threads. That's a relatively low-hanging fruit I noticed
> recently, but haven't had time to look into yet. I've filed:
> https://bugs.openjdk.org/browse/JDK-8335480
>
> This would of course still leave the handshakes, which also require a
> bunch of VM-internal synchronization between threads. Shared arenas
> are really only meant for long-lived lifetimes. So, either way, you
> may want to consider ways of grouping resources into the same shared
> arena, to reduce the total number of closures needed, or giving users
> the option of indicating that they only need to access a resource from
> a single thread, and then switching to a confined arena behind the
> scenes (which is much cheaper to close, as you've found).
>
> Jorn
>
> [1]: https://bugs.openjdk.org/browse/JDK-8290892
>
> On 1-7-2024 13:52, Uwe Schindler wrote:
>>
>> Hi Panama people, hello Maurizio,
>>
>> sending this again to the mailing list, we had just private
>> discussion with Maurizio. Maybe anyone else has an idea or might
>> figure out what the problem is. We are not yet ready to open issue
>> against Java 19 till 22.
>>
>> There were several issues reported by users of Apache Lucene (a
>> wrongly-written benchmark and also Solr users) about bad performance
>> in highly concurrent environments. Actually what was found out is
>> that when you have many threads closing shared arenas, under some
>> circumstances it causes all "reader threads" (those accessing
>> MemorySegment no matter which arena they use) suddenly deoptimize.
>> This causes immense slowdowns during Lucene searches.
>>
>> Lucily one of our committers found a workaround and we are
>> investigating to write a benchmark shoing the issue. But first let me
>> explain what happens:
>>
>> * Lucene opens MemorySegments with a shared Arenas (one per file)
>> and accesses them by multiple threads. Basically for each index
>> file we have a shared arena which is closed when the file is closed.
>> * There are many shared arenas (one per index file)!!!
>> * If you close a shared arena normally you see no large delay on
>> the thread calling the close and also no real effect on any
>> thread that reads from other MemorySegments
>> * But under certain circumstances ALL reading threads accessing any
>> MemorySegment slow down dramatically! So once you close one of
>> our Arenas, all other MemorySegments using a different shared
>> arena suddelny get deoptimized (we have Hotspot logs showing
>> this). Of course, the MemorySegment belonging to the closed arena
>> is no longer used and this one should in reality the only
>> affected one (throwing a IllegalStateEx).
>> * The problem seems to occur mainly when multiple Arenas are closed
>> in highly concurrent environments. This is why we did not see the
>> issue before.
>> * If we put a gloval lock around all calls to Arena#close() the
>> issues seem to go away.
>>
>> I plan to write some benchmark showing this issue. Do you have an
>> idea what could go wrong? To me it looks like race in the
>> thread-local handshakes which may cause some crazy hotspot behaviour
>> causing of deotimization of all threads concurrently accessing
>> MemorySegments once Arena#close() is called in highly concurrent
>> environments.
>>
>> This is the main issue where the observation is tracked:
>> https://github.com/apache/lucene/issues/13325
>>
>> These are issues opened:
>>
>> * https://github.com/dacapobench/dacapobench/issues/264 (the issue
>> on this benchmark was of course that they wer opening/closing too
>> often, but actually it just showed the problem, so it was very
>> helpful). Funny detail: Alexey Shipilev opened the issue!
>> * This was the comment showing the issue at huge installations of
>> Apache Solr 9.7:
>> https://github.com/apache/lucene/pull/13146#pullrequestreview-2089347714
>> (David Smiley also talked to me at berlinbuzzwords). They had to
>> disable MemorySegment usage in Apache SOlr/Lucene in their
>> environment. They have machines with thousands of indexes (and
>> therefor 10 thousands of Arenas) open at same time and the close
>> rate is very very high!
>>
>> Uwe
>>
>> --
>> Uwe Schindler
>> uschindler at apache.org
>> ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr
>> Bremen, Germany
>> https://lucene.apache.org/
>> https://solr.apache.org/
--
Uwe Schindler
uschindler at apache.org
ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr
Bremen, Germany
https://lucene.apache.org/
https://solr.apache.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240702/7b89fae5/attachment-0001.htm>
More information about the panama-dev
mailing list