RFR: 8357258: x86: Improve receiver type profiling reliability [v8]

Wed Dec 10 08:33:32 UTC 2025

On Thu, 4 Dec 2025 19:14:43 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

> I'm curious how performance-sensitive that part of code is. Does it make sense to try to further optimize it?

This is about 5-th-ish version of this code, so I don't think there is more juice to squeeze out of it. The current version is more or less optimal. The stratification into three cases looks the best performing overall.

> fast path can be further optimized for no nulls case by offloading more work on found_null slow path [1]

Yeah, but putting checks for both installed receiver and nullptr slot turns out hurting performance; this is bad even without extra control flow. Two separate loops are more efficient, even for small number of iterations. It also helpfully optimizes for the best case, when profile is smaller than `TypeProfileWidth`, which is what we want.

>  2 slots is the most common case; any benefits from optimizing specifically for it (e.g., unroll the loops)?

I don't think it is worth the extra complexity, honestly. The loop-y code in current version is still a significant code density win over the decision-tree (unrolled, effectively) approach we are doing currently. Keeping this thing simple means more reliability and less testing surface, plus much less headache to port to other architectures.

Note that the goal for this work is to _improve profiling reliability_ without hopefully ceding too much ground in code density and performance. When I started out, it was not clear if it is doable, given the need for atomics; but it looks doable indeed. So I think we should call this thing done and move on to solving the actual performance problem in this code: the contention on counter updates.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/25305#issuecomment-3635936502