[crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v2]

Ashutosh Mehra duke at openjdk.org
Thu Feb 9 16:41:34 UTC 2023


On Thu, 2 Feb 2023 06:06:28 GMT, Jan Kratochvil <duke at openjdk.org> wrote:

>> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash.
>> 
>> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored.
>> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC
>> 
>> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option:
>> 
>>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine
>> 
>> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run:
>> 
>>> Error occurred during initialization of VM
>>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi
>> 
>> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable.
>> 
>> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent OpenJDK upstream has already rejected such re-exec idea in the past: https://github.com/openjdk/crac/pull/31#issuecomment-1275707621
>> 
>> That IMO does not preclude trying the same for this case.
>> 
>> - Debian 11 x86_64: It does not work, glibc is too different and inlined there.
>> - Debian 12 x86_64: It works even without libc6-dbg as its offsets are the default.
>> - Fedora 36 x86_64: It works as on Fedoras glibc debuginfo is embedded.
>
> Jan Kratochvil has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - +comment; found by Dan Heidinga.
>  - Rename signalled->retry; found by Dan Heidinga.

IIUC `CPUFeatures=0` disables all the CPU features, which was never the intention behind PortableCode option. With `PortableCode` we wanted to enable a baseline set of feature which would cater to most commonly used microarchitectures.

Selection of default cpu features is a trade-off between portability and performance. Currently the cpu features are selected based on the native architecture which is one extreme as it makes the generated code less portable (unless you run on a sufficiently old arch). Similarly CPUFeatures=0 is another extreme of that which would make the code portable but is highly likely to degrade the performance. PortableCode lies somewhere in between.

In this context gcc provides a good analogy. It does not target i386 instruction set by default just to make the code more portable. The default target arch used by gcc is x86-64 which corresponds to "A generic CPU with 64-bit extensions". And if you look at the features that are enabled by default, you will find sse and sse2 extended instructions sets in there [0].

Now, I also understand CPUFeatures=<n> can be used to achieve the same affect that PortableCode option would. However, with PortableCode the default set of features would be automatically selected by the JVM; the user doesn't need to figure out the value to pass to CPUFeatures to make the code portable.
And as Dan mentioned earlier, even with PortableCode if the user wants to enable a particular extended instruction set, existing Hotspot options can be used.

[0] Use `gcc -Q --help=target`to see the list of options enabled or disabled

-------------

PR: https://git.openjdk.org/crac/pull/41


More information about the crac-dev mailing list