Unsafe vs MemorySegments / Bounds checking...

Tue Oct 29 18:35:42 UTC 2024

On 2024-10-29 10:44 AM, Maurizio Cimadamore wrote:
> Unfortunately we have never been able to come up with a reproducer for 
> the slow down you are experiencing.
> 
> If/when you have a standalone benchmark which shows the issue we will 
> obviously take a look at it.
> 

If you recall, I did send you a reproducer, and you did verify the 
regression. This is what led you to come up with a strategy to define a 
derived VarHandle. This helped somewhat, but you observed the inliner 
giving up when the code was embedded in a very large method. JDK 23 
appears to have introduced a regression that has made this worse.

Why is the inliner giving up? Is it because the VahHandle transform 
introduced too many layers of indirection? What if the equivalent 
VarHandle could be produced (or be built in) which skipped a few steps? 
Would this allowing the inlining to proceed as expected? Is there a 
better strategy? As it stands, there's no official path forward for 
replacing the Unsafe API in a way that retains it's efficiency.

> On 29/10/2024 15:39, Brian S O'Neill wrote:
>>
>> It looks like (again) the HotSpot inliner isn't doing enough to 
>> transform the code into the plain internal unsafe code. At the very 
>> least, I think there should be a convenience API which doesn't require 
>> me to apply a special transform step. The implementation could then at 
>> least employ the magic "force inline" annotations to ensure that 
>> there's no performance regressions.
> 
> Well, if it was that easy :-)
> 
> 99.99% of the work associated with such issues is to understand _where_ 
> adding ForceInline might be beneficial. Just blanket-adding that 
> everywhere will likely make your code slower, not faster.

In this particular case, I'd like the magic transformed VarHandle to 
somehow be collapsed into it's simpler form such that HotSpot doesn't 
see it as being too big. I don't care if the implementation relies on 
(ab)using the inliner or not. I think there needs to be a much simpler 
and more direct way to "just access memory".

> 
> 
> In general 99% of the cost associated with bound checks can be disabled 
> by using segments whose length is Long.MAX_VALUE. But, as we have 
> learned when looking at some of your examples, this doesn't fully 
> eliminate all the costs because there's still a sign check involved: 
> e.g. the offset into the memory segment must be > 0 - and there's not 
> much C2 can do to eliminate that at the moment.
> 
> Of course, when a segment is accessed in a loop, none of this matters - 
> checks will be hoisted out of the loops, and any added cost will be 
> amortized.
> 
> But if code accesses a memory segment (or a byte buffer, or...) in a 
> "random" fashion, then some of these additional costs might show up (and 
> some are, I think, unavoidable).

I don't think the bounds check is causing much of an issue. When I add 
explicit bounds checks of my own I don't see any issue either. I think 
it really is just the case of a long chain of transforms not being 
collapsed down into much simpler operation.

> 
> Popping back up 100 levels: your message seems to imply that a 
> requirement for deprecating unsafe is that we should have a replacement 
> API which offers 100% of Unsafe performance _and_ it is safe. I think 
> this angle is rather unworkable. That said, I don't want to fully close 
> the door to investigate whethet there's better "escape hatch" we can 
> express within FFM (e.g. using restricted methods) to support corner 
> cases where existing optimizations might not work too well. But to do 
> that, we need some benchmark to look at (preferrably one that doesn't 
> pull in an entire project).

No, I don't want a replacement for Unsafe which is also safe. I want 
something which offers the same performance, with the same potential 
risks, and therefore should remain restricted. The simplest approach 
would be to keep the Unsafe API, or something like it. I understand why 
this is undesirable, and I'm not advocating for it.

I feel perfectly comfortable using a VarHandle instead, and I do like 
the fact that it offers more features than Unsafe with respect to 
concurrency and byte ordering. The alternative would be to bloat the 
Unsafe class with a gazillion permutations, which is of course a mess.

I don't know what the "correct" FFM API should look like, but if it 
allows me to obtain a restricted VarHandle which "just accesses memory", 
then this has two benefits: 1) It makes it easier for applications to 
migrate off the Unsafe API, and 2) It makes it easier to optimize 
because the implementation can be more direct.

> 
> Maurizio
> 
>>
>>