RFR: 8357258: x86: Improve receiver type profiling reliability [v8]

Thu Dec 18 15:27:01 UTC 2025

On Wed, 10 Dec 2025 08:28:47 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> Overall, looks good to me. Nice work, Aleksey!
>> 
>> I'm curious how performance-sensitive that part of code is. Does it make sense to try to further optimize it?
>> 
>> For example:
>>   - 2 slots is the most common case; any benefits from optimizing specifically for it (e.g., unroll the loops)?
>>   -  fast path can be further optimized for no nulls case by offloading more work on found_null slow path [1]
>> 
>> [1]
>> 
>>     // Fastest: receiver is already installed
>>     int i = 0;
>>     for (; i < receiver_count(); i++) {
>>       if (receiver(i) == recv) goto found_recv(i);
>>       if (receiver(i) == null) goto found_null(i);
>>     }
>>   
>>     goto polymorphic
>>   
>>     // Slow: try to install receiver
>>   found_null(i):
>>     // Finish the search
>>     for (int j = i ; j < receiver_count(); j++) {
>>       if (receiver(j) == recv) goto found_recv(j);
>>     }
>>     CAS(&receiver(i), null, recv);
>>     goto restart
>> ...
>
>> I'm curious how performance-sensitive that part of code is. Does it make sense to try to further optimize it?
> 
> This is about 5-th-ish version of this code, so I don't think there is more juice to squeeze out of it. The current version is more or less optimal. The stratification into three cases looks the best performing overall.
> 
>> fast path can be further optimized for no nulls case by offloading more work on found_null slow path [1]
> 
> Yeah, but putting checks for both installed receiver and nullptr slot turns out hurting performance; this is bad even without extra control flow. Two separate loops are more efficient, even for small number of iterations. It also helpfully optimizes for the best case, when profile is smaller than `TypeProfileWidth`, which is what we want.
> 
>>  2 slots is the most common case; any benefits from optimizing specifically for it (e.g., unroll the loops)?
> 
> I don't think it is worth the extra complexity, honestly. The loop-y code in current version is still a significant code density win over the decision-tree (unrolled, effectively) approach we are doing currently. Keeping this thing simple means more reliability and less testing surface, plus much less headache to port to other architectures.
> 
> Note that the goal for this work is to _improve profiling reliability_ without hopefully ceding too much ground in code density and performance. When I started out, it was not clear if it is doable, given the need for atomics; but it looks doable indeed. So I think we should call this thing done and move on to solving the actual performance problem in this code: the contention on counter updates.

> Hi @shipilev , are you aware of anyone working on or planning to implement the same for AArch64 by any chance?

I'll task one of our folks to do it after NY break. 

Speaking of, I will integrate this one after NY break as well, to avoid dealing with any possible fallout during the holidays.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/25305#issuecomment-3670834495