[crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v30]

Anton Kozlov akozlov at openjdk.org
Thu Jun 8 13:33:23 UTC 2023

On Wed, 7 Jun 2023 12:57:59 GMT, Jan Kratochvil <jkratochvil at openjdk.org> wrote:

>> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash.
>> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored.
>> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC
>> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option:
>>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine
>> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run:
>>> Error occurred during initialization of VM
>>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi
>> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable.
>> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfo...
> Jan Kratochvil has updated the pull request incrementally with two additional commits since the last revision:
>  - Fix hotspot 'ht' vs. glibc 'htt'.
>  - CPUFeatures refactorization.
>    Start CPU Features checking without libc.

I meet with the same error as reported by GHA


Replacing vm_exit_during_initialization() with fatal() at least provides the hs_err and the stack trace.

Stack: [0x00007f80d6900000,0x00007f80d6a00000],  sp=0x00007f80d69fd760,  free space=1013k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x191bad4]  VM_Version::glibc_not_using(unsigned long, unsigned long)+0x414
V  [libjvm.so+0x191d3a2]  VM_Version::CPUFeatures_init()+0x1d2
V  [libjvm.so+0x191d67f]  VM_Version::get_processor_features()+0xef
V  [libjvm.so+0x1921f16]  VM_Version::initialize_features()+0x546
V  [libjvm.so+0x1922359]  VM_Version::initialize()+0x9
V  [libjvm.so+0x1916756]  VM_Version_init()+0x26
V  [libjvm.so+0xd30b4c]  init_globals()+0x2c
V  [libjvm.so+0x18110e6]  Threads::create_vm(JavaVMInitArgs*, bool*)+0x326
V  [libjvm.so+0xeb86d9]  JNI_CreateJavaVM+0x69
C  [libjli.so+0x3da4]  JavaMain+0x94
C  [libjli.so+0x7719]  ThreadJavaMain+0x9

anton at mercury:~/proj/crac$ git diff
diff --git a/src/hotspot/cpu/x86/vm_version_x86.cpp b/src/hotspot/cpu/x86/vm_version_x86.cpp
index a6a04458319..58249612750 100644
--- a/src/hotspot/cpu/x86/vm_version_x86.cpp
+++ b/src/hotspot/cpu/x86/vm_version_x86.cpp
@@ -1225,10 +1225,9 @@ void VM_Version::glibc_not_using(uint64_t excessive_CPU, uint64_t excessive_GLIB
 #ifdef ASSERT
 #define CHECK_KIND(kind) do {                                                                                   \
     if (PASTE_TOKENS(disable_handled_, kind) != PASTE_TOKENS(excessive_handled_, kind)) {                       \
-      jio_snprintf(errbuf, sizeof(errbuf),                                                                      \
+      fatal( \
                    "internal error: Unsupported disabling of " STR(kind) "_* 0x%" PRIx64 " != used 0x%" PRIx64, \
                    PASTE_TOKENS(disable_handled_, kind), PASTE_TOKENS(excessive_handled_, kind));               \
-      vm_exit_during_initialization(errbuf);                                                                    \
     }                                                                                                           \
   } while (0)

I'm not 100% sure fatal() is correct in that state, so I propose a vararg macro/function that expands to fatal(...), which can easily be replaced with something different.


PR Review: https://git.openjdk.org/crac/pull/41#pullrequestreview-1469711629

More information about the crac-dev mailing list