Unsafe vs MemorySegments / Bounds checking...

Thu Oct 31 10:00:14 UTC 2024

On 31/10/2024 09:45, Mike Hearn wrote:
>
>     Hence my suggestion to go back a little, and see what we can do to
>     speed up access for a segment created with:
>
>     |MemorySegment.NULL.reinterpret(Long.MAX_VALUE) |
>
>     (which, as Ron correctly points out, might not mean /exactly as
>     fast as Unsafe/)
>
> If a sign check is genuinely causing a meaningful slow down you could 
> potentially re-spec such a NULL->MAX memory segment to not do it. In 
> that case a negative number would be treated as unsigned. 
> Alternatively, the sign bit could be masked out of the address which 
> should be faster than a compare and branch. Given that such a memory 
> segment is already requesting that safety checks be disabled, maybe 
> the check for negative addresses isn't that important as there are 
> already so many ways to segfault the VM with such a segment.
>
That is a tempting path we have considered in the past. The drawback of 
that is that you have now obtained a new segment which doesn't behave 
like other segments. E.g. all the memory access operations, bulk copy 
operations, and even slicing will need to specify that some of the 
checks apply for all segment _but_ this weird one. Heck, even the size 
of this segment would be negative...

To be precise: the sign check is not causing the slow down, but the fact 
that there is a sign check is the reason bound checks _in loops_ cannot 
be completely eliminated as C2 has to be mindful of overflows. But if 
you have a situation where memory access does not follow a pattern 
(which seems to be the case here), then bound check elimination wouldn't 
kick in anyway.

I've read some exchange with Roland I had last year on this. The reason 
why random access is slower, has to do, fundamentally, with the fact 
that FFM has to execute more stuff than Unsafe -- there's no way around 
that. It used to be the case that, sometimes, C2 would try to speculate 
and remove bound checks, and causing regressions when doing so (because 
the loop didn't run for long enough). But this has long been fixed:

https://bugs.openjdk.org/browse/JDK-8311932

The workaround I came up with in the past:

https://mail.openjdk.org/pipermail/panama-dev/2023-July/019478.html

Was working because it effectively changed the shape of the code, and 
caused C2 to back off, and not introduce an optimization that was 
causing more cost than benefit. That should no longer be a problem today 
-- as C2 should only optimize loops where the trip count is longer than 
a certain threshold.

Stepping back... there's two way to approach this problem. One is to add 
more heroics so that C2 can somehow do more of what it's already doing. 
That's what we tried in the past, it works -- but up to a point. A more 
robust solution, IMHO, would be to find ways to reduce the complexity of 
the implementation when accessing a segment whose span is 0..MAX_VALUE. 
Maybe we can't eliminate _all_ checks (e.g. alignment and offset sign), 
but it seems to me that we can eliminate most of them.

Maurizio
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20241031/fd8cdf4f/attachment-0001.htm>