[crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration

Dan Heidinga heidinga at openjdk.org
Wed Feb 1 14:34:31 UTC 2023


On Mon, 30 Jan 2023 02:44:23 GMT, Jan Kratochvil <duke at openjdk.org> wrote:

> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash.
> 
> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored.
> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC
> 
> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option:
> 
>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine
> 
> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run:
> 
>> Error occurred during initialization of VM
>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi
> 
> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable.
> 
> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent OpenJDK upstream has already rejected such re-exec idea in the past: https://github.com/openjdk/crac/pull/31#issuecomment-1275707621
> 
> That IMO does not preclude trying the same for this case.
> 
> - Debian 11 x86_64: It does not work, glibc is too different and inlined there.
> - Debian 12 x86_64: It works even without libc6-dbg as its offsets are the default.
> - Fedora 36 x86_64: It works as on Fedoras glibc debuginfo is embedded.

This is a nice piece of engineering to make it work.  I'm concerned about the maintainability and correctness of the approach though, when running on different systems / glibcs and over time as glibc is recompiled.

Ashutosh suggested an alternative approach back in Feb 2022 [0] [1] of providing a single "-XX:[+-]PortableCode" with a common enough baseline set of features and the use of GLIBC_TUNABLES to control the glibc bindings.  

This is very similar to the approach the OpenJ9 team has taken in their CRIU efforts and have found that it works reasonable well for the use cases they have targeted (mostly restore inside a container).

[0] http://cr.openjdk.java.net/~heidinga/crac/Portability_of_checkpoints.pdf
[1] https://mail.openjdk.org/pipermail/crac-dev/2022-February/000114.html

src/hotspot/os/linux/os_linux.cpp line 6050:

> 6048:         unsigned long l = strtoul(name, &endptr, 10);
> 6049:         pid_t ent_tid = l;
> 6050:         if (l >= LONG_MAX || ent_tid != (long) l) {

`l` is an `unsigned long` and I think `LONG_MAX` is a signed value as there's also a `ULONG_MAX`.  Are we using this to validate that `l` is a positive value?  Otherwise, I'm not sure I understand this condition.

src/hotspot/os/linux/os_linux.cpp line 6074:

> 6072:   static pthread_mutex_t signalled_mutex;
> 6073:   static pthread_cond_t signalled_cond;
> 6074:   static size_t signalled;

The above `template<class Callback> bool all_threads` also uses a "signalled" local variable.  It might be clearer to rename one of these so future readers don't have to think about scoping and shadowing rules

src/hotspot/os/linux/os_linux.cpp line 6086:

> 6084:     err = pthread_cond_signal(&signalled_cond);
> 6085:     assert(!err, "pthread error");
> 6086:     pthread_mutex_t unused_mutex = PTHREAD_MUTEX_INITIALIZER;

Waiting on a local mutex deserves a comment to explain the approach.  I think we're waiting until the thaw operation signals the resume condition but better to be explicit in the code on the intended interactions between the waiters

src/hotspot/os/linux/os_linux.cpp line 6177:

> 6175:   assert(sig == RESTORE_SIGNAL, "got what requested");
> 6176: 
> 6177:   linux_ifunc_reset();

Assuming we're freezing all the threads to allow resetting the linux_ifuncs, it would be good to comment explicitly about why that's the right approach and/or necessary.

Also, what if we stopped a thread that was about to be call one of these ifuncs - that had already loaded the old address into a register - before we did the reset?  Wouldn't it still have the wrong value and with the right (wrong?) thread scheduling, still abort after the restore?

Basically, I think there is probably a race here still.

src/hotspot/os/linux/os_linux_ifunc.cpp line 43:

> 41: // These offsets are for Debian 12 x86_64:
> 42: // gdb -batch /lib64/ld-linux-x86-64.so.2 -ex
> 43: static unsigned   l_scope_offset = 0x3b0;

Committing this into the repo seems very suspect unless we have some guarantee that these offsets are stable.  Is there such a guarantee?

src/hotspot/os/linux/os_linux_ifunc.cpp line 145:

> 143:   unsigned sht;
> 144: };
> 145: static int symtab_lookup_iterate_phdr(struct dl_phdr_info *info, size_t size, void *data_voidp) {

This seems like some really cool "I made it work"-style hacking and something that would be utterly unsupportable long term.

I'd be pretty concerned if we merged this approach into the repo in this form.  If glibc provided an API to reset the ifunc bindings, then I'd be way more comfortable with the approach.

-------------

PR: https://git.openjdk.org/crac/pull/41


More information about the crac-dev mailing list