RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3]

Tue Sep 9 11:51:00 UTC 2025

On Fri, 5 Sep 2025 17:17:52 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> This patch optimizes PopCount value transforms using KnownBits information.
>> Following are the results of the micro-benchmark included with the patch
>> 
>> 
>> 
>> System: 13th Gen Intel(R) Core(TM) i3-1315U
>> 
>> Baseline:
>> Benchmark                                      Mode  Cnt       Score   Error  Units
>> PopCountValueTransform.LogicFoldingKerenLong  thrpt    2  215460.670          ops/s
>> PopCountValueTransform.LogicFoldingKerenlInt  thrpt    2  294014.826          ops/s
>> PopCountValueTransform.StockKernelInt         thrpt    2  409295.875          ops/s
>> PopCountValueTransform.StockKernelLong        thrpt    2  368025.608          ops/s
>> 
>> Withopt:
>> Benchmark                                      Mode  Cnt       Score   Error  Units
>> PopCountValueTransform.LogicFoldingKerenLong  thrpt    2  389978.082          ops/s
>> PopCountValueTransform.LogicFoldingKerenlInt  thrpt    2  417261.583          ops/s
>> PopCountValueTransform.StockKernelInt         thrpt    2  418649.269          ops/s
>> PopCountValueTransform.StockKernelLong        thrpt    2  381330.221          ops/s
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update countbitsnode.cpp

Very nice improvement @jatin-bhateja , thanks for working on it :)

src/hotspot/share/opto/countbitsnode.cpp line 125:

> 123:         range is computed using the following formulas:-
> 124:         - _hi = ~ZEROS
> 125:         - _lo = ONES

Is there going to be some other Lemma here, that gives rise to the numbering? I'd just remove the numbering.

src/hotspot/share/opto/countbitsnode.cpp line 128:

> 126: Proof:-
> 127:   - KnownBits.ZEROS and KnownBits.ONES are inferred out of the common prefix of the value range
> 128:     delimiting bounds.

It could come from the range. But it could also come from individual bits being and-ed or or-ed to 1 or 0. I'll give an alternative suggestion below.

src/hotspot/share/opto/countbitsnode.cpp line 145:

> 143:     B) Now, transform the computed knownbits back to the value range.
> 144:       _new_lo = _known_bits.ones  = 0b11000100
> 145:       _new_hi = ~known_bits.zeros = 0b11000111

This kinda duplicates all the descriptions that we have in KnownBits. I would drop it. Or maybe just refer to something over there.

src/hotspot/share/opto/countbitsnode.cpp line 149:

> 147:   - We now know that ~KnownBits.ZEROS >= UB >= LB >= KnownBits.ONES
> 148:   - Therefore, popcount(ONES) and popcount(~ZEROS) can safely be assumed as the upper and lower
> 149:     bounds of the result value range.

I don't quite see how that follows from the proof. And I'm also worried about the correctness.

You are using the signed `_lo` and `_hi`. But the zeros and ones are unsigned. So it is a bit unclear what your comparisons prove here - you should probably cast one to signed or the other to unsigned to make things explicit.

One crucial step here is also the linearity assumption of `popcount`. You'd need to show or at least assert that:

~KnownBits.ZEROS >= UB >= t >= LB >= KnownBits.ONES
implies
popcount(~KnownBits.ZEROS) >= popcount(UB) >= popcount(t) >= popcount(LB) >= popcount(KnownBits.ONES)

It all sounds a bit complicated, and I think I would prefer something along the lines of what @SirYwell suggested.

src/hotspot/share/opto/countbitsnode.cpp line 150:

> 148:   - Therefore, popcount(ONES) and popcount(~ZEROS) can safely be assumed as the upper and lower
> 149:     bounds of the result value range.
> 150: */

Suggestion:

// We use the KnownBits information from the integer types to derive how many one bits
// we have at least and at most.
// From the definition of KnownBits, we know:
//   zeros: Indicates which bits must be 0: ones[i] =1 -> t[i]=0
//   ones:  Indicates which bits must be 1: zeros[i]=1 -> t[i]=1
//
// From this, we derive:
//   numer_of_zeros_in_t >= pop_count(zeros)
//   -> number_of_ones_in_t <= bits_per_type - pop_count(zeros) = pop_count(~zeros)
//   number_of_ones_in_t >= pop_count(ones)
//
// By definition:
//   pop_count(t) = number_of_ones_in_t
//
// It follows:
//   pop_count(ones) <= pop_count(t) <= pop_count(~zeros)
//
// Note: signed _lo and _hi, as well as unsigned _ulo and _uhi bounds of the integer types
//       are already reflected in the KnownBits information, see TypeInt / TypeLong definitions.

Feel free to adjust the formulation :)

test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 74:

> 72:         }
> 73:         return 1;
> 74:     }

Can we not assert that there is exactly one popcount? The two should fold to one, no?

test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 114:

> 112:         }
> 113:         return 1;
> 114:     }

Thanks for the tests!

I think it would be quite valuable to have some tests that do not just clamp the range, but also create random `KnownBits`, i.e. with random and/or masks.

For example:
`num = (num | ONES) & ZEROS;`

And then you generate `ONES` and `ZEROS` randomly, maybe even using `Generators`?
Then round it off with some random range comparisons at the end:
`        if (Integer.bitCount(num) >= CON1 && Integer.bitCount(num) <= CON2) {`

test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 148:

> 146: 
> 147:     public static void main(String[] args) {
> 148:         TestFramework.runWithFlags("-XX:-TieredCompilation", "-XX:CompileThresholdScaling=0.2");

Can you explain the need for these flags?
The TestFramework eventually enqueues for compilation anyway. Or is there something about profiling?

test/micro/org/openjdk/bench/java/lang/PopCountValueTransform.java line 79:

> 77:         }
> 78:         return res;
> 79:     }

I assume the `stock` kernels are there to show performance if there is no op, the `folding` kernels you hope have the same performance. It would be nice to have one where the `bitCount` does not fold away, just to keep that comparison :)

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/27075#pullrequestreview-3200918107
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333099461
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333106910
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333117632
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333147248
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333189002
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333222066
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333199688
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333087776
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333210925