RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3]
Emanuel Peter
epeter at openjdk.org
Tue Sep 9 11:51:00 UTC 2025
On Fri, 5 Sep 2025 17:17:52 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> This patch optimizes PopCount value transforms using KnownBits information.
>> Following are the results of the micro-benchmark included with the patch
>>
>>
>>
>> System: 13th Gen Intel(R) Core(TM) i3-1315U
>>
>> Baseline:
>> Benchmark Mode Cnt Score Error Units
>> PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s
>> PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s
>> PopCountValueTransform.StockKernelInt thrpt 2 409295.875 ops/s
>> PopCountValueTransform.StockKernelLong thrpt 2 368025.608 ops/s
>>
>> Withopt:
>> Benchmark Mode Cnt Score Error Units
>> PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s
>> PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s
>> PopCountValueTransform.StockKernelInt thrpt 2 418649.269 ops/s
>> PopCountValueTransform.StockKernelLong thrpt 2 381330.221 ops/s
>>
>>
>> Kindly review and share your feedback.
>>
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> Update countbitsnode.cpp
Very nice improvement @jatin-bhateja , thanks for working on it :)
src/hotspot/share/opto/countbitsnode.cpp line 125:
> 123: range is computed using the following formulas:-
> 124: - _hi = ~ZEROS
> 125: - _lo = ONES
Is there going to be some other Lemma here, that gives rise to the numbering? I'd just remove the numbering.
src/hotspot/share/opto/countbitsnode.cpp line 128:
> 126: Proof:-
> 127: - KnownBits.ZEROS and KnownBits.ONES are inferred out of the common prefix of the value range
> 128: delimiting bounds.
It could come from the range. But it could also come from individual bits being and-ed or or-ed to 1 or 0. I'll give an alternative suggestion below.
src/hotspot/share/opto/countbitsnode.cpp line 145:
> 143: B) Now, transform the computed knownbits back to the value range.
> 144: _new_lo = _known_bits.ones = 0b11000100
> 145: _new_hi = ~known_bits.zeros = 0b11000111
This kinda duplicates all the descriptions that we have in KnownBits. I would drop it. Or maybe just refer to something over there.
src/hotspot/share/opto/countbitsnode.cpp line 149:
> 147: - We now know that ~KnownBits.ZEROS >= UB >= LB >= KnownBits.ONES
> 148: - Therefore, popcount(ONES) and popcount(~ZEROS) can safely be assumed as the upper and lower
> 149: bounds of the result value range.
I don't quite see how that follows from the proof. And I'm also worried about the correctness.
You are using the signed `_lo` and `_hi`. But the zeros and ones are unsigned. So it is a bit unclear what your comparisons prove here - you should probably cast one to signed or the other to unsigned to make things explicit.
One crucial step here is also the linearity assumption of `popcount`. You'd need to show or at least assert that:
~KnownBits.ZEROS >= UB >= t >= LB >= KnownBits.ONES
implies
popcount(~KnownBits.ZEROS) >= popcount(UB) >= popcount(t) >= popcount(LB) >= popcount(KnownBits.ONES)
It all sounds a bit complicated, and I think I would prefer something along the lines of what @SirYwell suggested.
src/hotspot/share/opto/countbitsnode.cpp line 150:
> 148: - Therefore, popcount(ONES) and popcount(~ZEROS) can safely be assumed as the upper and lower
> 149: bounds of the result value range.
> 150: */
Suggestion:
// We use the KnownBits information from the integer types to derive how many one bits
// we have at least and at most.
// From the definition of KnownBits, we know:
// zeros: Indicates which bits must be 0: ones[i] =1 -> t[i]=0
// ones: Indicates which bits must be 1: zeros[i]=1 -> t[i]=1
//
// From this, we derive:
// numer_of_zeros_in_t >= pop_count(zeros)
// -> number_of_ones_in_t <= bits_per_type - pop_count(zeros) = pop_count(~zeros)
// number_of_ones_in_t >= pop_count(ones)
//
// By definition:
// pop_count(t) = number_of_ones_in_t
//
// It follows:
// pop_count(ones) <= pop_count(t) <= pop_count(~zeros)
//
// Note: signed _lo and _hi, as well as unsigned _ulo and _uhi bounds of the integer types
// are already reflected in the KnownBits information, see TypeInt / TypeLong definitions.
Feel free to adjust the formulation :)
test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 74:
> 72: }
> 73: return 1;
> 74: }
Can we not assert that there is exactly one popcount? The two should fold to one, no?
test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 114:
> 112: }
> 113: return 1;
> 114: }
Thanks for the tests!
I think it would be quite valuable to have some tests that do not just clamp the range, but also create random `KnownBits`, i.e. with random and/or masks.
For example:
`num = (num | ONES) & ZEROS;`
And then you generate `ONES` and `ZEROS` randomly, maybe even using `Generators`?
Then round it off with some random range comparisons at the end:
` if (Integer.bitCount(num) >= CON1 && Integer.bitCount(num) <= CON2) {`
test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 148:
> 146:
> 147: public static void main(String[] args) {
> 148: TestFramework.runWithFlags("-XX:-TieredCompilation", "-XX:CompileThresholdScaling=0.2");
Can you explain the need for these flags?
The TestFramework eventually enqueues for compilation anyway. Or is there something about profiling?
test/micro/org/openjdk/bench/java/lang/PopCountValueTransform.java line 79:
> 77: }
> 78: return res;
> 79: }
I assume the `stock` kernels are there to show performance if there is no op, the `folding` kernels you hope have the same performance. It would be nice to have one where the `bitCount` does not fold away, just to keep that comparison :)
-------------
Changes requested by epeter (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/27075#pullrequestreview-3200918107
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333099461
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333106910
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333117632
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333147248
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333189002
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333222066
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333199688
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333087776
PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333210925
More information about the hotspot-compiler-dev
mailing list