RFR: 8334431: C2 SuperWord: fix performance regression due to store-to-load-forwarding failures [v2]

Tue Nov 19 14:55:04 UTC 2024

On Mon, 18 Nov 2024 08:04:35 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> **History**
>> This issue became apparent with https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155):
>> On machines that do not support sha intrinsics, we execute the sha code in java code. This java code has a loop that previously did not vectorize, but it now does since https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155). It turns out that that kind of loop is actually slower when vectorized - this led to a regression, reported originally as:
>> `8334431: Regression 18-20% on Mac x64 on Crypto.signverify`
>> 
>> I then investigated the issue thoroughly, and discovered that it was even an issue before https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155). I wrote a [blog-post ](https://eme64.github.io/blog/2024/06/24/Auto-Vectorization-and-Store-to-Load-Forwarding.html) about the issue.
>> 
>> **Summary of Problem**
>> 
>> As described in the [blog-post ](https://eme64.github.io/blog/2024/06/24/Auto-Vectorization-and-Store-to-Load-Forwarding.html), vectorization can introduce store-to-load failures that were not present in the scalar loop code. Where in scalar code, the loads and stores were all exactly overlapping or non-overlapping, in vectorized code they can now be partially overlapping. When a store and a later load are partially overlapping, the store value cannot be directly forwarded from the store-buffer to the load (would be fast), but has to first go to L1 cache. This incurs a higher latency on the dependency edge from the store to the load.
>> 
>> **Benchmark**
>> 
>> I introduced a new micro-benchmark in https://github.com/openjdk/jdk/pull/19880, and now further expanded it in this PR. You can see the extensive results in [this comment below](https://github.com/openjdk/jdk/pull/21521#issuecomment-2458938698).
>> 
>> The benchmarks look different on different machines, but they all have a pattern similar to this:
>> ![image](https://github.com/user-attachments/assets/3366f7fa-af44-44d4-a476-8cd0466fe937)
>> ![image](https://github.com/user-attachments/assets/1c1408c2-053e-4a8a-ad46-32b75b836161)
>> ![image](https://github.com/user-attachments/assets/d392c8cf-fb62-4593-93c7-a0d85ad5885e)
>> ![image](https://github.com/user-attachments/assets/3a79601f-4015-4f71-a510-cab7d7b59ed8)
>> 
>> We see that the `scalar` loop is faster for low `offset`, and the `vectorized` loop is faster for high offsets (and power-of-w offse...
>
> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 25 commits:
> 
>  - manual merge
>  - Merge branch 'master' into JDK-8334431-V-store-to-load-forwarding
>  - Merge branch 'master' into JDK-8334431-V-store-to-load-forwarding
>  - fix whitespace
>  - fix tests and build
>  - fix store-to-load forward IR rules
>  - updates before the weekend ... who knows if they are any good
>  - refactor to iteration threshold
>  - use jvmArgs again, and apply same fix as 8343345
>  - revert to jvmArgsPrepend
>  - ... and 15 more: https://git.openjdk.org/jdk/compare/543e355b...000f9f13

src/hotspot/share/opto/c2_globals.hpp line 361:

> 359:           "if >0, auto-vectorization detects possible store-to-load"        \
> 360:           "forwarding failures. The number specifies over how many"         \
> 361:           "loop iterations this detection spans.")                          \

You should add whitespaces at the end of the line to avoid connected words:
Suggestion:

          "if >0, auto-vectorization detects possible store-to-load "       \
          "forwarding failures. The number specifies over how many "        \
          "loop iterations this detection spans.")                          \

src/hotspot/share/opto/vtransform.cpp line 162:

> 160:   uint _memory_size;
> 161:   bool _is_load;      // load or store?
> 162:   uint _order;        // order in schedule

You could rename this to `_schedule_order`. Then you can remove the comment.

src/hotspot/share/opto/vtransform.cpp line 188:

> 186:       int r1_inva_idx = r1->invar() == nullptr ? 0 : r1->invar()->_idx;
> 187:       int r2_inva_idx = r2->invar() == nullptr ? 0 : r2->invar()->_idx;
> 188:       RETURN_CMP_VALUE_IF_NOT_EQUAL(r1_inva_idx,      r2_inva_idx);

Suggestion:

      int r1_invar_idx = r1->invar() == nullptr ? 0 : r1->invar()->_idx;
      int r2_invar_idx = r2->invar() == nullptr ? 0 : r2->invar()->_idx;
      RETURN_CMP_VALUE_IF_NOT_EQUAL(r1_invar_idx,      r2_invar_idx);

src/hotspot/share/opto/vtransform.cpp line 234:

> 232: // its value from the store-buffer, rather than from the L1 cache. This is many CPU cycles
> 233: // faster. However, this optimization comes with some restrictions, depending on the CPU.
> 234: // Generally, Store-to-load forwarding works if the load and store memory regions match

For consistency:
Suggestion:

// Generally, store-to-load-forwarding works if the load and store memory regions match

src/hotspot/share/opto/vtransform.cpp line 235:

> 233: // faster. However, this optimization comes with some restrictions, depending on the CPU.
> 234: // Generally, Store-to-load forwarding works if the load and store memory regions match
> 235: // exactly (same start and width). Generally problematic are partial overlaps - though

Should we also mention here that it also works when the loaded data is fully contained in the stored data. (taken from your blog post). Maybe you can also add some examples from your blog post which helped to understand this optimization better when reading the first time about it.

src/hotspot/share/opto/vtransform.cpp line 237:

> 235: // exactly (same start and width). Generally problematic are partial overlaps - though
> 236: // some CPU's can handle even some subsets of these cases. We conservatively assume that
> 237: // all such partial overlaps lead to a store-to-load-forwarding failure, which means the

Suggestion:

// all such partial overlaps lead to a store-to-load-forwarding failures, which means the

src/hotspot/share/opto/vtransform.cpp line 238:

> 236: // some CPU's can handle even some subsets of these cases. We conservatively assume that
> 237: // all such partial overlaps lead to a store-to-load-forwarding failure, which means the
> 238: // load has to stall until the store goes from the store-buffer into the L1 cache, incuring

Suggestion:

// load has to stall until the store goes from the store-buffer into the L1 cache, incurring

src/hotspot/share/opto/vtransform.cpp line 247:

> 245: //   }
> 246: //
> 247: // Assume we have a 2-element vectors (2*4 = 8 bytes). This gives us this machine code:

Suggestion:

// Assume we have 2-element vectors (2*4 = 8 bytes). This gives us this machine code:

src/hotspot/share/opto/vtransform.cpp line 259:

> 257: // be forwarded because there is some partial overlap.
> 258: //
> 259: // Preferrably, we would have some latency-based cost-model that accounts for such forwarding

Suggestion:

// Preferably, we would have some latency-based cost-model that accounts for such forwarding

src/hotspot/share/opto/vtransform.cpp line 260:

> 258: //
> 259: // Preferrably, we would have some latency-based cost-model that accounts for such forwarding
> 260: // failures, and decides if vectorization with forwarding failures is still profitable. For

Suggestion:

// failures, and decide if vectorization with forwarding failures is still profitable. For

src/hotspot/share/opto/vtransform.cpp line 261:

> 259: // Preferrably, we would have some latency-based cost-model that accounts for such forwarding
> 260: // failures, and decides if vectorization with forwarding failures is still profitable. For
> 261: // now we go with a simpler huristic: we simply forbid vectorization if we can PROVE that

Suggestion:

// now we go with a simpler heuristic: we simply forbid vectorization if we can PROVE that

src/hotspot/share/opto/vtransform.cpp line 263:

> 261: // now we go with a simpler huristic: we simply forbid vectorization if we can PROVE that
> 262: // there will be a forwarding failure. This approach has at least 2 possible weaknesses:
> 263: //  (1) There may be forwarding failures in cases where we cannot prove it.

Maybe add a new line here
Suggestion:

//
//  (1) There may be forwarding failures in cases where we cannot prove it.

src/hotspot/share/opto/vtransform.cpp line 271:

> 269: //      We do not know if aI and bI refer to the same array or not. However, it is reasonable
> 270: //      to assume that if we have two different array references, that they most likely refer
> 271: //      to different arrays, where we would have no forwarding failures.

For completeness:
Suggestion:

//      to different arrays (i.e. no aliasing), where we would have no forwarding failures.

src/hotspot/share/opto/vtransform.cpp line 278:

> 276: // Performance measurements with the JMH benchmark StoreToLoadForwarding.java have indicated
> 277: // that there is some iteration threshold: if the failure happens between a store and load that
> 278: // have an iteration distance below this threshold, the latency is the limiting factor, and we

It's probably clear what you mean by "iteration distance" but maybe to be sure, you can add at your example above that the "iteration distance" is 3 there.

src/hotspot/share/opto/vtransform.cpp line 292:

> 290:   // To detect store-to-load-forwarding failures at the iteration threshold or below, we
> 291:   // simulate a super-unrolling to reach SuperWordStoreToLoadForwardingFailureDetection
> 292:   // iterations at least.

The comment could be a bit misleading when `simulated_unrolling_count` < `unrolled_count`. Looks like you then just do the analysis with the current `unrolled_count`. Can you extend this comment to mention that?

src/hotspot/share/opto/vtransform.cpp line 297:

> 295:   uint simulated_super_unrolling_count = MAX2(1, simulated_unrolling_count / unrolled_count);
> 296:   int iv_stride = vloop_analyzer.vloop().iv_stride();
> 297:   int order = 0;

To make it more clear what this order is:
Suggestion:

  int schedule_order = 0;

src/hotspot/share/opto/vtransform.cpp line 307:

> 305:           VTransformVectorNode* vector = vtn->isa_Vector();
> 306:           uint vector_length = vector != nullptr ? vector->nodes().length() : 1;
> 307:           records.push(VPointerRecord(p, iv_offset, vector_length, order++));

Suggestion:

          records.push(VPointerRecord(p, iv_offset, vector_length, schedule_order++));

src/hotspot/share/opto/vtransform.cpp line 313:

> 311:   }
> 312: 
> 313:   // Sort the pointers by group (same base, invar and stride), and by offset.

Suggestion:

  // Sort the pointers by group (same base, invar and stride), and then by offset.

src/hotspot/share/opto/vtransform.cpp line 328:

> 326: #endif
> 327: 
> 328:   // For all pairs of pointers in the same group, check if they have partial overlap.

Suggestion:

  // For all pairs of pointers in the same group, check if they have a partial overlap.

src/hotspot/share/opto/vtransform.cpp line 332:

> 330:     VPointerRecord& record1 = records.at(i);
> 331: 
> 332:     for(int j = i + 1; j < records.length(); j++) {

Suggestion:

    for (int j = i + 1; j < records.length(); j++) {

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848281955
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848409352
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848416569
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848289128
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848295567
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848296424
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848296769
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848297537
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848307762
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848308419
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848308782
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848309441
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848311419
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848397758
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848422595
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848408057
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848408767
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848436839
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848437407
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848464296