RFR: 8331341: secondary_super_cache does not scale well: C1 and interpreter [v10]

Mon Jul 29 13:10:36 UTC 2024

On Mon, 29 Jul 2024 10:35:07 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> This patch expands the use of a hash table for secondary superclasses
>> to the interpreter, C1, and runtime. It also adds a C2 implementation
>> of hashed lookup in cases where the superclass isn't known at compile
>> time.
>> 
>> HotSpot shared runtime
>> ----------------------
>> 
>> Building hashed secondary tables is now unconditional. It takes very
>> little time, and now that the shared runtime always has the tables, it
>> might as well take advantage of them. The shared code is easier to
>> follow now, I think.
>> 
>> There might be a performance issue with x86-64 in that we build
>> HotSpot for a default x86-64 target that does not support popcount.
>> This means that HotSpot C++ runtime on x86 always uses a software
>> emulation for popcount, even though the vast majority of machines made
>> for the past 20 years can do popcount in a single instruction. It
>> wouldn't be terribly hard to do something about that.
>> 
>> Having said that, the software popcount is really not bad.
>> 
>> x86
>> ---
>> 
>> x86 is rather tricky, because we still support
>> `-XX:-UseSecondarySupersTable` and `-XX:+UseSecondarySupersCache`, as
>> well as 32- and 64-bit ports. There's some further complication in
>> that only `RCX` can be used as a shift count, so there's some register
>> shuffling to do. All of this makes the logic in macroAssembler_x86.cpp
>> rather gnarly, with multiple levels of conditionals at compile time
>> and runtime.
>> 
>> AArch64
>> -------
>> 
>> AArch64 is considerably more straightforward. We always have a
>> popcount instruction and (thankfully) no 32-bit code to worry about.
>> 
>> Generally
>> ---------
>> 
>> I would dearly love simply to rip out the "old" secondary supers cache
>> support, but I've left it in just in case someone has a performance
>> regression.
>> 
>> The versions of `MacroAssembler::lookup_secondary_supers_table` that
>> work with variable superclasses don't take a fixed set of temp
>> registers, and neither do they call out to to a slow path subroutine.
>> Instead, the slow patch is expanded inline.
>> 
>> I don't think this is necessarily bad. Apart from the very rare cases
>> where C2 can't determine the superclass to search for at compile time,
>> this code is only used for generating stubs, and it seemed to me
>> ridiculous to have stubs calling other stubs.
>> 
>> I've followed the guidance from @iwanowww not to obsess too much about
>> the performance of C1-compiled secondary supers lookups, and to prefer
>> simplicity over absolute performance. Nonetheless, this i...
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Minor

I  promise that if you say you really want this change I will do it, but there is a cost I want to make clear.

Adding the full-bitmap test at the start of the fast-path code increases the execution time in the case of `SecondarySupersLookup.testPositive03` from 5 cycles/op to 5.5 cycles/op on average. It also adds at least 5 bytes (8 bytes for AArch64) to the inline code size, depending on how you do it.

In contrast, my proposed fix makes the invariant `pocount(bitmap) >= secondary_supers.length` truly invariant, and changes the full-bitmap test at the start of the slow path thusly to a void a performance regression with a nearly-full bitmap:

--- a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
+++ b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
@@ -5212,8 +5212,8 @@ void MacroAssembler::lookup_secondary_supers_table_slow_path(Register r_super_kl
   // The bitmap is full to bursting.
   // Implicit invariant: BITMAP_FULL implies (length > 0)
   assert(Klass::SECONDARY_SUPERS_BITMAP_FULL == ~uintx(0), "");
-  cmpq(r_bitmap, (int32_t)-1); // sign-extends immediate to 64-bit value
-  jcc(Assembler::equal, L_huge);
+  cmpq(r_array_length, (int32_t)SECONDARY_SUPERS_TABLE_SIZE - 2);
+  jcc(Assembler::greater, L_huge);
 
@@ -344,11 +370,12 @@ uintx Klass::hash_secondary_supers(Array<Klass*>* secondaries, bool rewrite) {
     return uintx(1) << hash_slot;
   }
 
--- a/src/hotspot/share/oops/klass.cpp
+++ b/src/hotspot/share/oops/klass.cpp
@@ -344,11 +370,12 @@ uintx Klass::hash_secondary_supers(Array<Klass*>* secondaries, bool rewrite) {
     return uintx(1) << hash_slot;
   }
 
-  // For performance reasons we don't use a hashed table unless there
-  // are at least two empty slots in it. If there were only one empty
-  // slot it'd take a long time to create the table and the resulting
-  // search would be no faster than linear probing.
-  if (length > SECONDARY_SUPERS_TABLE_SIZE - 2) {
+  // Invariant: _secondary_supers.length >= population_count(_secondary_supers_bitmap)
+
+  // Don't attempt to hash a table that's completely full, because in
+  // the case of an absent interface linear probing would not
+  // terminate.
+  if (length >= SECONDARY_SUPERS_TABLE_SIZE) {
     return SECONDARY_SUPERS_BITMAP_FULL;
   }
 


So, what I'm suggesting is a bit smaller, a bit faster, and less work for me. On the other hand you say

> It doesn't look right when the code treats secondary_supers as a table irrespective of whether it was hashed or not. IMO > it unnecessarily complicates things and may continue to be a source of bugs.

I agree about the "It doesn't look right" part, but I'm not sure I agree about the cause of the bug. IMO, that was the failure to make the `pocount(bitmap) >= secondary_supers.length` truly invariant.

Your call.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/19989#issuecomment-2255892483