RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store

Mon Apr 15 06:52:47 UTC 2024

On Thu, 21 Mar 2024 22:48:01 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>>>     * No RangeCheck smearing, or other CFG between the stores: `RC[0], store[0], RC[1], store[1], RC[2], store[2], RC[3], store[3]`. Not so simple. We can merge the 4 stores on the normal path, where all RC's pass. But we have to remove all old stores from that path. But the `RC[1], RC[2], RC[3]` false paths need some of those stores. So the only way I see is to duplicate all stores for the branches, so that we are sure that they sink out into the trap-paths.
>> 
>> I also think you need to duplicate stores. My opinion is that we want to stick with the simpler cases (your first and second bullets) unless it's obvious it doesn't cover all use cases. It's always possible to revisit the optimization down the road if it's observed that there are cases that are not covered.
>
>> > ```
>> > * No RangeCheck smearing, or other CFG between the stores: `RC[0], store[0], RC[1], store[1], RC[2], store[2], RC[3], store[3]`. Not so simple. We can merge the 4 stores on the normal path, where all RC's pass. But we have to remove all old stores from that path. But the `RC[1], RC[2], RC[3]` false paths need some of those stores. So the only way I see is to duplicate all stores for the branches, so that we are sure that they sink out into the trap-paths.
>> > ```
>> 
>> I also think you need to duplicate stores. My opinion is that we want to stick with the simpler cases (your first and second bullets) unless it's obvious it doesn't cover all use cases. It's always possible to revisit the optimization down the road if it's observed that there are cases that are not covered.
> 
> I completely agree with Roland.

@vnkozlov 
> Can we detect presence of RangeCheck which may cause us to move some stores on fail path and bailout the optimization. I don't think it is frequent case. I assume you will get RC on each store or not at all ("main" part of counted loop). Am I wrong here? I don't remember, does C2 optimize RangeCheck nodes in linear code (it does in loops)?

I know about 2 relevant optimizations that remove / move RangeChecks:
- RCE (RangeCheck Elimination from loops): hoist all RangeCheck before the loop. That way, there are no RangeChecks left in the loop, and there would be no RangeChecks between the stores we are merging.
- RangeCheck Smearing: this also applies in straight-line code, outside of loops. See `RangeCheckNode::Ideal`. Example:

RangeCheck[i+0]
Store[i+0]
RangeCheck[i+1]  <--- replaced with i+3 ("smearing" to cover all RC below)
Store[i+1]
RangeCheck[i+2]  <--- removed
Store[i+2]
RangeCheck[i+3]  <--- removed
Store[i+3]

becomes:

RangeCheck[i+0]
Store[i+0]
RangeCheck[i+3]  <--- the RangeCheck that remains between the first and the rest of the consecutive (and adjacent) stores.
Store[i+1]
Store[i+2]
Store[i+3]

I think the use-cases from @cl4es are often in straight-line code. Therefore we should cover the "smearing" case where exactly 1 RC remains in the sequence.

What you can also see in `RangeCheckNode::Ideal`: if we ever trap (or often enough, I don't remember) in one of the RangeChecks, then we disable `phase->C->allow_range_check_smearing()`. Then we don't do the smearing, and all the RC remain in the sequence. At that point, my optimization would fail since it sees more than 1 RC in the sequence.

Does that make sense? I should probably add this information in the comments, so  that it is clear why we worry about a single RC at all. People are probably going to wonder like you: "I assume you will get RC on each store or not at all".

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-2055667649