RFR: 8358521: Optimize vector operations by reassociating broadcasted inputs [v4]

Emanuel Peter epeter at openjdk.org
Tue Feb 24 10:34:56 UTC 2026


On Tue, 24 Feb 2026 06:17:30 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi all,
>> 
>> This patch optimizes SIMD kernels making heavy use of broadcasted inputs through following reassociating ideal transformations.
>> 
>> 
>>  VectorOperation (VectorBroadcast INP1,  VectorBroadcast INP2) => 
>>                             VectorBroadcast (ScalarOpration INP1, INP2)
>> 
>>  VectorOperation (VectorBroadcast INP1) (VectorOperation (VectorBroadcast INP2) INP3) => 
>>                              VectorOperation INP3 (VectorOperation (VectorBroadcast INP1) (VectorBroadcast INP2))
>> 
>> 
>> The idea is to push broadcasts across the vector operation and replace the vector with an equivalent, cheaper scalar variant.  Currently, patch handles most common vector operations.
>> 
>> Following are the performance number of benchmark included with this patch on latest generation x86 targets:- 
>> 
>> **AMD Turin (2.1GHz)**
>> <img width="1122" height="355" alt="image" src="https://github.com/user-attachments/assets/3f5087bf-0e14-4c56-b0c2-3d23253bad54" />
>> 
>> **Intel Granite Rapids (2.1GHz)**
>> <img width="1105" height="325" alt="image" src="https://github.com/user-attachments/assets/c8481f86-4db2-4c4e-bd65-51542c59fe63" />
>> 
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Review comments resolutions

Interesting work, @jatin-bhateja !

Looks like a reasonable optimization.

We need to be very careful with subword types: the vector operations have truncating behavior, and the scalar operations do not. We need to test the optimization well for that. You are currently missing tests for `byte` etc, you need to add those :)

src/hotspot/share/opto/subnode.hpp line 524:

> 522:     if (c != nullptr) {
> 523:       C->add_expensive_node(this);
> 524:     }

Can you explain a bit more about this change? What is the flag used for, and why did you need to change it?

src/hotspot/share/opto/subnode.hpp line 524:

> 522:     if (c != nullptr) {
> 523:       C->add_expensive_node(this);
> 524:     }

Can you please explain why you made this change?

src/hotspot/share/opto/vectornode.cpp line 301:

> 299: // Return the scalar opcode for the specified vector opcode
> 300: // and basic type.
> 301: int VectorNode::scalar_opcode(int vopc, BasicType bt, bool enable_assertions) {

Do you really need the `enable_assertions` argument here? You always set it to `false` when calling it.
And the user could still assert if they get a `0` return.

src/hotspot/share/opto/vectornode.cpp line 301:

> 299: // Return the scalar opcode for the specified vector opcode
> 300: // and basic type.
> 301: int VectorNode::scalar_opcode(int vopc, BasicType bt, bool enable_assertions) {

`enable_assertions` is never used with `true`. You should just remove it.

src/hotspot/share/opto/vectornode.cpp line 442:

> 440:              "Vector node %s is not handled in VectorNode::scalar_opcode",
> 441:              NodeClassNames[vopc]);
> 442:       return 0; // Unimplemented

Maybe "not handled" is better here. That is also what your assert says. Because it may be that both the vector and scalar ops are implemented, we just don't handle it here.

src/hotspot/share/opto/vectornode.cpp line 442:

> 440:              "Vector node %s is not handled in VectorNode::scalar_opcode",
> 441:              NodeClassNames[vopc]);
> 442:       return 0; // Unimplemented

I would not say unimplemented, rather not handled. The scalar and vector operations may be implemented, just not handled here.

src/hotspot/share/opto/vectornode.cpp line 1194:

> 1192: 
> 1193: // Helper function to select a vector operation with all broadcasted inputs
> 1194: bool VectorNode::can_push_broadcasts_across_vector_operation(BasicType bt) {

The comment does not quite seem to match here. We don't do any "selecting" here, we just check and return a boolean, right?

src/hotspot/share/opto/vectornode.cpp line 1194:

> 1192: 
> 1193: // Helper function to select a vector operation with all broadcasted inputs
> 1194: bool VectorNode::can_push_broadcasts_across_vector_operation(BasicType bt) {

The comment does not match. The method does no selection, rather it just checks. Please fix :)

src/hotspot/share/opto/vectornode.cpp line 1213:

> 1211: }
> 1212: 
> 1213: Node* VectorNode::scalar_node_factory(Compile* c, int sopc, Node* control, Node* in1, Node* in2, Node* in3) {

I think generally, consistency with our naming scheme would be `make_scalar`, rather than "factory".

src/hotspot/share/opto/vectornode.cpp line 1213:

> 1211: }
> 1212: 
> 1213: Node* VectorNode::scalar_node_factory(Compile* c, int sopc, Node* control, Node* in1, Node* in2, Node* in3) {

I think we usually call factory methods like this: `make_scalar`.
Also: Compile is usually an upper case `C`.

src/hotspot/share/opto/vectornode.cpp line 1280:

> 1278:       return new FmaDNode(in1, in2, in3);
> 1279:     default:
> 1280:       return nullptr;

Is returning `nullptr` ok? Or should we have checked the conditions earlier?
Could we add an assert here?

src/hotspot/share/opto/vectornode.cpp line 1280:

> 1278:       return new FmaDNode(in1, in2, in3);
> 1279:     default:
> 1280:       return nullptr;

Could we add an assert?

src/hotspot/share/opto/vectornode.cpp line 1294:

> 1292:   cloned_parent->set_req(2, pinput2);
> 1293:   return cloned_parent;
> 1294: }

Can you add a comment that explains the code shape before and after?
Otherwise I'd have to reconstruct the semantics from the call site.

Also: the method is non-static, and `partent == this`. Is that on purpose?

src/hotspot/share/opto/vectornode.cpp line 1294:

> 1292:   cloned_parent->set_req(2, pinput2);
> 1293:   return cloned_parent;
> 1294: }

Multiple points:
- This method is non-static, and it seems all calls to it have `this == parent`.
- Can you please add a comment that shows the transformation using the input names?

src/hotspot/share/opto/vectornode.cpp line 1326:

> 1324:       return create_reassociated_node(this, in2, in1, in2_1, in2_2, phase);
> 1325:     } else if (in2_2->Opcode() == Op_Replicate) {
> 1326:       return create_reassociated_node(this, in2, in1, in2_2, in2_1, phase);

It looks like you made `create_reassociated_node` non-static, and still pass in `this` as the first argument. Is that on purpose?

src/hotspot/share/opto/vectornode.cpp line 1339:

> 1337: //   VectorBroadcast (ScalarOperation INP1, INP2)
> 1338: //
> 1339: Node* VectorNode::push_broadcast_across_vector_operation(PhaseGVN* phase) {

I would rename it to `push_through_broadcast`, to keep it similar to `split_thru_region/phi` and `push_thru_add` etc.

src/hotspot/share/opto/vectornode.cpp line 1339:

> 1337: //   VectorBroadcast (ScalarOperation INP1, INP2)
> 1338: //
> 1339: Node* VectorNode::push_broadcast_across_vector_operation(PhaseGVN* phase) {

I would call this method like `split_thru_phi` or `push_through_add`:
`push_through_broadcast`.

src/hotspot/share/opto/vectornode.cpp line 1361:

> 1359: 
> 1360:     Node* sop = scalar_node_factory(phase->C, scalar_opcode(Opcode(), bt, false), in(0), sinp1, sinp2, sinp3);
> 1361:     if (sop) {

Suggestion:

    if (sop != nullptr) {

Implicit null check not allowed (hotspot style guide)

src/hotspot/share/opto/vectornode.cpp line 1363:

> 1361:     if (sop) {
> 1362:       sop = phase->transform(sop);
> 1363:       return new ReplicateNode(sop, vect_type());

Suppose we have an `add` operation on a `byte` vector. The operations are truncated. But the `AddI` operation would overflow the `byte` range. Does the `Replicate` node ensure that truncation?

src/hotspot/share/opto/vectornode.cpp line 1363:

> 1361:     if (sop) {
> 1362:       sop = phase->transform(sop);
> 1363:       return new ReplicateNode(sop, vect_type());

How sure are you that the truncating vector operations are correctly optimized with the non-truncating scalar operations? Is it guaranteed that `Replicate` does the truncation?

src/hotspot/share/opto/vectornode.cpp line 2104:

> 2102:     return VectorNode::degenerate_vector_rotate(in(1), in(2), true, vlen, bt, phase);
> 2103:   }
> 2104:   return VectorNode::Ideal(phase, can_reshape);

Would it not make sense to do the `Ideal` before the possible degeneration? Just in case we can optimize away the rotation, by pushing it through the broadcast?

src/hotspot/share/opto/vectornode.cpp line 2104:

> 2102:     return VectorNode::degenerate_vector_rotate(in(1), in(2), true, vlen, bt, phase);
> 2103:   }
> 2104:   return VectorNode::Ideal(phase, can_reshape);

Similar for other cases below: why not do `Ideal` before degeneration?

src/hotspot/share/opto/vectornode.cpp line 2411:

> 2409:     }
> 2410:   }
> 2411:   return VectorNode::Ideal(phase, can_reshape);

Same question here: why not first try to do `Ideal` before resorting to degeneration?

test/hotspot/jtreg/compiler/vectorapi/TestVectorBroadcastReassociations.java line 42:

> 40: 
> 41:     public static void main(String[] args) {
> 42:         TestFramework.runWithFlags("--add-modules=jdk.incubator.vector");

This is really important: you need to test the subword types. We need to be sure that the truncated behavior of the vector operations are correctly optimized with the non-truncating scalar operations.

test/hotspot/jtreg/compiler/vectorapi/TestVectorBroadcastReassociations.java line 50:

> 48: 
> 49:     static final VectorSpecies<Integer> ISP = IntVector.SPECIES_PREFERRED;
> 50:     static int ia = 17, ib = 9;

We need to have random inputs. Because if something goes wrong, and for some reason the vector and scalar operations don't have the same semantics, we need to catch that.

Vector operations have truncation, and scalar operations do not. We need to make sure we test that behavior.

test/hotspot/jtreg/compiler/vectorapi/TestVectorBroadcastReassociations.java line 50:

> 48: 
> 49:     static final VectorSpecies<Integer> ISP = IntVector.SPECIES_PREFERRED;
> 50:     static int ia = 17, ib = 9;

You need random input values that span the whole range.

We really need to be careful to test that the vector and scalar operations do the same thing here. We've had issues with truncation behavior before. So you need to test overflows.

test/hotspot/jtreg/compiler/vectorapi/TestVectorBroadcastReassociations.java line 55:

> 53:     @IR(failOn = IRNode.ADD_VI,
> 54:         applyIfCPUFeatureOr = {"avx", "true", "asimd", "true"},
> 55:         counts = { IRNode.ADD_I, ">= 1", IRNode.REPLICATE_I, IRNode.VECTOR_SIZE_ANY, ">= 1" })

Why do you have the `VECTOR_SIZE_ANY`?

test/hotspot/jtreg/compiler/vectorapi/TestVectorBroadcastReassociations.java line 60:

> 58:                 .add(IntVector.broadcast(ISP, ib))
> 59:                 .reduceLanes(VectorOperators.ADD);
> 60:     }

Why do you have the `VECTOR_SIZE_ANY`? Do we not get a replicate with the maximal size?

I don't think it is wise to take an ADD reduction here, because that may also produce a ADD_I node. And then you're not sure if the IR rule finds that one or the one generated from the broadcast optimization.

test/hotspot/jtreg/compiler/vectorapi/TestVectorBroadcastReassociations.java line 136:

> 134:      * ======================= */
> 135: 
> 136:     static final VectorSpecies<Long> LSP = LongVector.SPECIES_PREFERRED;

Why don't you have the logical operators for long tested below?

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/25617#pullrequestreview-3846137446
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845457463
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845895589
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845472407
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845898307
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845572109
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845902534
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845591368
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845909648
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845695003
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845914938
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845703712
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845916417
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845718222
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845923701
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845710822
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845772045
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845928736
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845799719
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845815077
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845936327
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845829459
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845938602
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845835099
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845891340
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845851868
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845875701
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845856355
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845869949
PR Review Comment: https://git.openjdk.org/jdk/pull/25617#discussion_r2845881322


More information about the core-libs-dev mailing list