RFR: 8357258: x86: Improve receiver type profiling reliability [v3]

Sat Nov 22 21:45:51 UTC 2025

On Wed, 24 Sep 2025 13:08:14 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> See the bug for discussion what issues current machinery has. 
>> 
>> This PR executes the plan outlined in the bug:
>>  1. Common the receiver type profiling code in interpreter and C1
>>  2. Rewrite receiver type profiling code to only do atomic receiver slot installations
>>  3. Trim `C1OptimizeVirtualCallProfiling` to only claim slots when receiver is installed 
>> 
>> This PR does _not_ do atomic counter updates themselves, as it may have much wider performance implications, including regressions. This PR should be at least performance neutral.
>> 
>> Additional testing:
>>   - [x] Linux x86_64 server fastdebug, `compiler/`
>>   - [x] Linux x86_64 server fastdebug, `all`
>
> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits:
> 
>  - Merge branch 'master' into JDK-8357258-x86-c1-optimize-virt-calls
>  - Merge branch 'master' into JDK-8357258-x86-c1-optimize-virt-calls
>  - Drop atomic counters
>  - Initial version

Code is good.  Consider changing a name and adding documentation.

src/hotspot/cpu/x86/interp_masm_x86.cpp line 524:

> 522:   LP64_ONLY(assert(Rsub_klass != r13, "r13 holds bcp");)
> 523:   assert(Rsub_klass != rcx, "rcx holds 2ndary super array length");
> 524:   assert(Rsub_klass != rdi, "rdi holds 2ndary super array scan ptr");

I think you can kill this assert as well; rdi is no longer relevant to this function.

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 4760:

> 4758: }
> 4759: 
> 4760: void MacroAssembler::type_profile(Register recv, Register mdp, int mdp_offset) {

The name chosen is subtly misleading.  We have value (argument/parameter/return) profiling as well as receiver profiling.  Since this particular macro-instruction is closely coupled to `ReceiverTypeData`, I suggest calling it `profile_receiver_type`, and documenting, up top, that it is precisely for collecting data into that structure.

The name being replaced (`record_klass_in_profile_helper`) has the same problem.  This is a historical artifact; the name was chosen before other sorts of type profiles were introduced.

(And `profile_receiver_type` is surely better than `receiver_type_profile`, which is not a verb phrase.)

Eventually we may wish to improve the other kinds of profiling, which have their own structures and representations.  I thought for a while about what that might look like, and particularly if it factored into a different set of macro-instructions.  Could we factor this proposed macro into a "find entry" part and an "increment counter" part?  But no, it doesn't seem to pay off.  There's benefit to preserving the jewel-like conciseness of the code pattern here.  So I guess future work on other type profiles is mostly independent.

But we do need a more specific name, that makes very clear the coupling to `ReceiverTypeData`.  Even if the old code had that problem also.  Putting it way out here in the macro-assembler makes such a problem worse, since the interpreter "knows about" MDOs, but the macro-assembler doesn't.

I don't object to moving this down to the macro-assembler.  It is no longer coupled to the interpreter, after the JIT learned the same trick.  I think we should prepare ourselves, mentally, for similar moves with the other type profile mechanisms.

I think the definition of `class ReceiverTypeData` should mention this macro.  Otherwise we won't know where to look for updates (since it's no longer bundled with the interpreter).  This macro is, in effect, a member of that class.   (That's true of other MDO structures:  Random assembly code is part of their APIs.  The C++ code is very vague about how and where this happens.  That's a problem for another time, I guess.)

Another point.  I would like to see pseudo-code that sketches what this complicated macro emits.  (I was the author of the other pseudo-code deleted by this patch; I like that sort of thing.)  I suggest:

// Traverse and update a ReceiverTypeData record in a method-data object.
// This operation can be performed either by the interpreter or by JIT code.
// The receiver klass has already been loaded into recv.
// The base address of the MDO is mdp, and the byte offset mdp_offset is also applied.
// The emitted code traverses the array of entries and picks
// one where the expected receiver matches, or allocates a free one if necessary.
// If a matching entry exists (perhaps upon creation), a receiver count is incremented.
// If no matching entry exists, the shared (CounterData) count is incremented.
// For safety, receiver cells are claimed with a CAS.  For speed, counter updates are not.
// Duplicate receiver allocation is possible due to races, but this is unlikely.
// Occasional races on counters may introduce inconsequential noise.
//
// Here is pseudocode for the emitted assembly:
//
//   int i, *cntp;
//   for (i = 0; i < receiver_count(); i++) {  // optimistic loop
//     if (receiver(i) == recv)   goto found_recv;
//   }
//   for (i = 0; i < receiver_count(); i++) {  // allocating loop
//     if (receiver(i) == null) { <attempt CAS of recv from null into receiver(i)> }
//     if (receiver(i) == recv)  goto found_recv;
//   }
//   cntp = &count();  // shared poly count, used if no match for recv
//   goto count_update;
//   found_recv:
//   cntp = &receiver_count(i);
//   count_update:
//   ++*cntp;
//
void MacroAssembler::receiver_type_profile…

I did wonder if the `mdp`/`mdp_offset` pair would be better expressed by an `Address`.
Note that `Address::plus_disp` would let you move the cursor around.  But then you wouldn't
have such fine control over the x86 address scaling feature, which apparently contributes
to the compactness of the code.

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 4812:

> 4810: 
> 4811:   // Optimistic: search for already set up receiver.
> 4812:   movptr(offset, base_receiver_offset);

I wondered about using REP-CMPSQ to search the receiver array.  It would require reformatting the MDO to make the receiver klasses contiguous.  The x86 manual ORM (August 2023) cheers me down:

> Using a REP prefix with string move instructions can provide high performance in the situations described above. However, using a REP prefix with string scan instructions (SCASB, SCASW, SCASD, SCASQ) or compare instructions (CMPSB, CMPSW, SMPSD, SMPSQ) is not recommended for high performance. Consider using SIMD instructions instead.

I still wonder if, at some point, it will be profitable to make the receivers contiguous so we can use SIMD instructions to search them.  Probably not any time soon.

-------------

Changes requested by jrose (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/25305#pullrequestreview-3489875340
PR Review Comment: https://git.openjdk.org/jdk/pull/25305#discussion_r2547626482
PR Review Comment: https://git.openjdk.org/jdk/pull/25305#discussion_r2553359799
PR Review Comment: https://git.openjdk.org/jdk/pull/25305#discussion_r2553367490