[crac] RFR: Move CPUFeatures verification to the parent process of JVM

Timofei Pushkin tpushkin at openjdk.org
Tue Apr 29 09:50:14 UTC 2025


On Sun, 27 Apr 2025 15:18:17 GMT, Jan Kratochvil <jkratochvil at openjdk.org> wrote:

> There was originally a mistake:
> - restoring JVM did restore the image
> - the restored JVM started checking whether CPU Features of the new host >= CPU Features of the checkpoint host
> 
> That is difficult as glibc is already configured (IFUNC) in the image for the checkpoint host and calling any such glibc functions in the restored image will crash (as the advanced instructions from misconfigured IFUNC are not available). Some glibc functions had to be reimplemented in a dummy way inside JVM due to this misdesign.
> 
> This patch changes it to:
> - restoring JVM checks `cpufeatures` user data in the image against current CPU Features
> - the restored JVM is started only if the CPU Features are satisfied, restored JVM no longer has to verify anything
> 
> The patch is a bit of a kitchen sink, there are various improvements of the CPU Features code.

I have not finished the review, mainly just looked at the `runtime` changes. Will continue later, or maybe someone else will.

src/hotspot/share/runtime/crac.hpp line 80:

> 78: 
> 79:   static bool cpufeatures_store();
> 80:   static bool cpufeatures_restore();

These can be removed since they are not implemented

src/hotspot/share/runtime/crac_engine.cpp line 414:

> 412: }
> 413: 
> 414: const char CracEngine::userdata_name[] = "cpufeatures";

Let's name this `cpufeatures_userdata_name` since it only applies to CPU features.

I would also make in a .cpp-level static since it is only used in this .cpp file and mark it `constexpr` (though in practice it probably makes no difference here), but up to you.

src/hotspot/share/runtime/crac_engine.cpp line 421:

> 419:     log_error(crac)("Installed CRaC engine does not support user_data");
> 420:   }
> 421:   return api;

Please do checks and error messages like in other API-loading functions (e.g. `prepare_description_api`) for consistency

src/hotspot/share/runtime/crac_engine.cpp line 439:

> 437: 
> 438: // Return success.
> 439: bool CracEngine::cpufeatures_restore() {

To me "restore" means that the function restores some state of the restored VM but here we just want to check if the features are compatible. So I think a better name would be `cpufeatures_check` or `cpufeatures_load_and_check`.

I would also place a verb first (`store_cpufeatures`, `check_cpufeatures`) but this is obviously a matter of preference.

src/hotspot/share/runtime/crac_engine.cpp line 444:

> 442:     // s3->set_image_bitmask did handle it already, load_user_data() is too expensive for S3.
> 443:     return true;
> 444:   }

There is no special handling for `s3://` paths in this project

src/hotspot/share/runtime/crac_engine.cpp line 452:

> 450:   if (!(user_data = user_data_api->load_user_data(_conf))) {
> 451:     return false;
> 452:   }

Just for consistency with the rest of the code:
Suggestion:

  if (!(user_data = user_data_api->load_user_data(_conf))) {
    log_error(crac)("CRaC engine failed to load user data");
    return false;
  }

src/hotspot/share/runtime/crac_engine.cpp line 461:

> 459:       return false;
> 460:     }
> 461:     assert(datap, "lookup_user_data should return non-null data pointer");

CRaC engine can be a user-provided program so maybe it is better to do a real check and log error if it fails

src/hotspot/share/runtime/crac_engine.hpp line 70:

> 68: 
> 69:   bool cpufeatures_store();
> 70:   bool cpufeatures_restore();

These can be marked `const`

src/hotspot/share/runtime/crac_engine.hpp line 82:

> 80:   static const char userdata_name[];
> 81: 
> 82:   crlib_user_data_t *user_data_api_get();

Could you please rewrite this part to work the same way as the other extensions: public `ApiStatus prepare_user_data_api()` loads and caches the API, then the dependent methods use the cached one?
- The API can be loaded multiple times when repeated checkpoint (checkpoint -> restore -> checkpoint again) is performed
- We may later add more methods that use user data API
- It makes `CracEngine` more consistent

src/hotspot/share/runtime/threads.cpp line 481:

> 479:   // Output stream module should be already initialized for error reporting during restore.
> 480:   // JDK version should also be intialized for arguments parsing.
> 481:   if (check_for_restore(args) != JNI_OK) return JNI_ERR;

Now we do almost the same thing twice on restore: here  we parse the args and check whether they are restore-settable, and then in `Arguments::parse` below we parse the args again. I'm not sure if it breaks anything (`Arguments::parse` may do the parsing a bit differently and overwrite some values with different results). If not, it's worth to at least add a TODO here to clean this up in the future.

src/hotspot/share/runtime/threads.cpp line 606:

> 604: 
> 605:   // Output stream module should be already initialized for error reporting during restore.
> 606:   // JDK version should also be intialized.

This comment should probably be updated saying that a lot of things (probably no point in listing them all) should be initialized to be able to check CPU features

src/java.base/share/man/java.md line 1838:

> 1836:     `-XX:CRaCCheckpointTo` when you get an error during `-XX:CRaCRestoreFrom`
> 1837:     on a different machine. `-XX:CPUFeatures=native` is the default.
> 1838:     `-XX:CPUFeatures=generic` is compatible with any CPU.

There is also `CPUFeatures=ignore`, what is the difference between it and `IgnoreCPUFeatures`? The fact that `IgnoreCPUFeatures` is to be set on restore?

-------------

PR Comment: https://git.openjdk.org/crac/pull/227#issuecomment-2838145657
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2065737258
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2065789850
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2065781726
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2065818710
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2065797780
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2065801325
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2065807680
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2065794912
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2065763641
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2065926059
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2065929561
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2065946444


More information about the crac-dev mailing list