Array addition and array sum Panama benchmarks

Thu Mar 21 10:41:52 UTC 2024

Thanks for the explanation.

Thinkng a bit more, as Roland pointed out, I believe there are two 
issues here:

1. disjointness analysis doesn't work for all heap, which is a known issue
2. even in cases where disjointness analysis works, we can't 
autovectorize because we read "longs" which are then turned into "double".

These seem two orthogonal issues. While I think it would be worthwhile 
to fix (1) -  I have seen other cases where suboptimal code was 
generated because of that, I don't think we're fully out of the woods 
with that.

The fix I came up with yesterday seems a reasonable stop-gap solution 
for (2): if the memory var handle is fully aligned, and its endianness 
is == platform endianness, then don't bother with the long -> double 
trip and just use Unsafe::getDouble. That said, this fix will only work 
under these conditions (aligned _plain_ access with right endianness). 
Anything else will fall back to the old pattern. This tweak shouldn't 
cost anything, as these conditions are invariants for a given var handle 
instance (whose final fields are trusted, as defined in 
"java.lang.invoke"), which is typically held in a static final field, so 
everything should be known to the JIT. If we want to address that at the 
vectorizer level, it will probably require deeper changes which treat 
the Unsafe.getLong + Long.longBitsToDouble as a single operation.

Thoughts?

Maurizio

On 20/03/2024 19:59, John Rose wrote:
> This fits into a loop optimization technique that Roland worked on, 
> called loop predication. You speculate that the loop invariant inputs 
> are somehow favorable and test. If all is well you run the loop 
> transformed to exploit the speculated favorable condition. In this 
> case it is disjointness of reads and writes. If the speculation fails 
> you might recompile, or use a defined fallback loop.
>
> If the loop is worth vectorizing the cost of checking disjointness is 
> comparatively small, commensurate with other predicates we now use.
>
> Certainly a disjointness test is cheaper than the very subtle range 
> analysis we do routinely for range check elimination, of any loop 
> containing array accesses linear in the loop trip count.
>
> So, it’s merely one of those “small matters of programming”.
>
>> On Mar 20, 2024, at 11:51 AM, Maurizio Cimadamore 
>> <maurizio.cimadamore at oracle.com> wrote:
>>
>> 
>>
>> Thanks for the analysis Roland. If I understand correctly, even if 
>> this analysis is not supported today, in principle it could be done, 
>> right?
>>
>> After all, we know ia/oa (e.g. their addresses) and the ranges we're 
>> going to access these at (otherwise we could not hoist bound checks 
>> outside the loop).
>>
>> So, we might be able, in principle, to check that - perhaps the issue 
>> is that (as in this case) all the addresses are completely dynamic, 
>> so you really need a disjointness runtime check (outside the loop), 
>> which might be more expensive than the incremental benefit added by 
>> vectorization?
>>
>> Maurizio
>>
>>
>> On 20/03/2024 16:26, Roland Westrelin wrote:
>>> is that the compiler can't prove it's legal to vectorize. Doubles are
>>> read from ia and oa and then added and written back to oa. There's no
>>> way for the compiler to tell that the off heap areas pointed to by ia
>>> and oa don't overlap. So possibly, the value written to:
>>>
>>> oa + 8*i
>>>
>>> is going to be read back at the next iteration with:
>>>
>>> ia + 8*i
>>>
>>> (ia could be oa+8)
>>>
>>> The autovectorizer would need to insert a runtime check that the 2 areas
>>> don't overlap but there's no support for that at this point. I suppose
>>> the same issue exists with the MemorySegment API when memory is off
>>> heap.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240321/c5cf5622/attachment-0001.htm>