RFR: Store cpu features in AOTCodeCache header
Ashutosh Mehra
asmehra at openjdk.org
Mon Jul 14 18:50:59 UTC 2025
On Mon, 14 Jul 2025 08:35:42 GMT, Radim Vansa <rvansa at openjdk.org> wrote:
>> @jankratochvil I'm not disagreeing - merely wanting to see where CRaC goes wrong and why (noting that it is a slightly different circumstance to the AOT Cache case).
>>
>> My preferred position would be that there /ought not/ to be any issues reusing code generated during the AOT Cache assembly phase in the absence of some given feature in combination with with code generated in a production run where that feature is present i.e. that that any incompatibility would constitute an error in the code generation scheme. I'd be very interested in any evidence you can provide to show that 1) such a code generation error exists or, worse, 2) my preference is unrealizable.
>
> I can't find the logs from when I was investigating the issue, but AFAIR https://github.com/openjdk/crac/pull/103 was motivated by a bug that happened in compiler thread; it was going through some code that calculated buffer size for output code based on the availability of CPU features, and then it went to actually write down the instructions. When the checkpoint happened in the middle of this and the CPU got changed (we got a 'better' CPU) the decision in this codepath was changed, and resulted in a buffer overrun.
> So it was rather a synchronization problem: some code was written assuming that the CPU features are runtime-constant, but these are not. There is certainly space for a better solution, but we would have to track through some complex code and make sure that it works on a 'snapshot' of features.
@rvansa
> it was going through some code that calculated buffer size for output code based on the availability of CPU features, and then it went to actually write down the instructions. When the checkpoint happened in the middle of this and the CPU got changed (we got a 'better' CPU) the decision in this codepath was changed, and resulted in a buffer overrun.
I can imagine this happening in the context of checkpoint-snapshot but I don't think think AOTCodeCache can hit this issue of buffer overrun. Code generation is not suspended-resumed in Leyden workflow. When the code generation starts, it is always completed before the JVM exits the assembly phase.
> So it was rather a synchronization problem: some code was written assuming that the CPU features are runtime-constant, but these are not. There is certainly space for a better solution, but we would have to track through some complex code and make sure that it works on a 'snapshot' of features.
Other than this buffer overrun problem, have you come across any other code that relies on the assumption that CPU features are runtime-constant?
-------------
PR Review Comment: https://git.openjdk.org/leyden/pull/84#discussion_r2205588977
More information about the leyden-dev
mailing list