RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2]

Tue Jul 1 07:10:41 UTC 2025

On Tue, 1 Jul 2025 06:41:32 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> Ping again! Thanks in advance!
>
>> @XiaohongGong I'm a little busy at the moment, and soon going on a summer vacation, so I cannot promise a full review soon. Feel free to ask someone else to have a look.
>> 
>> I quickly looked through your new benchmark results you published after integration of #25539. There seem to still be a few cases where `Gain < 1`. Especially:
>> 
>> ```
>> GatherOperationsBenchmark.microShortGather512_MASK         256 thrpt  30  ops/ms 11587.465  10674.598  0.92
>> GatherOperationsBenchmark.microShortGather512_MASK        1024 thrpt  30  ops/ms  2902.731   2629.739  0.90
>> GatherOperationsBenchmark.microShortGather512_MASK        4096 thrpt  30  ops/ms   741.546    671.124  0.90
>> ```
>> 
>> and
>> 
>> ```
>> GatherOperationsBenchmark.microShortGather256_MASK         256 thrpt  30  ops/ms 11339.217  10951.141  0.96
>> GatherOperationsBenchmark.microShortGather256_MASK        1024 thrpt  30  ops/ms  2840.081   2718.823  0.95
>> GatherOperationsBenchmark.microShortGather256_MASK        4096 thrpt  30  ops/ms   725.334    696.343  0.96
>> ```
>> 
>> and
>> 
>> ```
>> GatherOperationsBenchmark.microByteGather512_MASK           64 thrpt  30  ops/ms 50588.210  48220.741  0.95
>> ```
>> 
>> Do you know what happens in those cases?
> 
> Thanks for your input! Yes, I spent some time making an analysis on these little regressions. Seems there are the architecture HW influences like the cache miss or code alignment. I tried with a larger loop alignment like 32, and the performance will be improved and regressions are gone. Since I'm not quite familiar with X86 architectures, I'm not sure of the exact point. Any suggestions on that?

> @XiaohongGong Maybe someone from Intel (@jatin-bhateja @sviswa7) can help you with the x86 specific issues. You could always use hardware counters to measure cache misses. Also if the vectors are not cache-line aligned, there may be split loads or stores. Also that can be measured with hardware counters. Maybe the benchmark needs to be improved somehow, to account for issues with alignment.

I also tried to measure cache misses with perf on my x86 machine, and I noticed the cache miss is increased. The generated code layout of the test/benchmark is changed with my changes in Java side, so I guess maybe the alignment is different with before. To verify my thought, I used the vm option `-XX:OptoLoopAlignment=32`, and the performance can be improved a lot compared with the version without my change. So I think the patch itself maybe acceptable even we noticed minor regressions.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3022195040