[crac] RFR: 8355974: [CRaC] Move CPUFeatures verification to the parent process of JVM [v11]

Timofei Pushkin tpushkin at openjdk.org
Tue May 13 19:18:09 UTC 2025


On Tue, 13 May 2025 14:46:30 GMT, Jan Kratochvil <jkratochvil at openjdk.org> wrote:

>> There was originally a mistake:
>> - restoring JVM did restore the image
>> - the restored JVM started checking whether CPU Features of the new host >= CPU Features of the checkpoint host
>> 
>> That is difficult as glibc is already configured (IFUNC) in the image for the checkpoint host and calling any such glibc functions in the restored image will crash (as the advanced instructions from misconfigured IFUNC are not available). Some glibc functions had to be reimplemented in a dummy way inside JVM due to this misdesign.
>> 
>> This patch changes it to:
>> - restoring JVM checks `cpufeatures` user data in the image against current CPU Features
>> - the restored JVM is started only if the CPU Features are satisfied, restored JVM no longer has to verify anything
>> 
>> The patch is a bit of a kitchen sink, there are various improvements of the CPU Features code.
>
> Jan Kratochvil has updated the pull request incrementally with six additional commits since the last revision:
> 
>  - Fix LoggingVMlogOpenTestNegative.java failing in Github CI
>  - Remove S3 support
>  - Revert "Reimplement S3"
>    
>    This reverts commit 719e7568262f87d7367c6656827ece796e35cadb.
>  - Revert "More S3 fixes but it is untested and for 99.9% not yet working"
>    
>    This reverts commit 5ce33434f66913f83b53941b4f9aff8d012f6372.
>  - More S3 fixes but it is untested and for 99.9% not yet working
>  - Reimplement S3

src/hotspot/cpu/x86/vm_version_x86.cpp line 2618:

> 2616:     warning("-XX:CRaCRestoreFrom is mutually exclusive with -XX:CPUFeatures; specify -XX:CPUFeatures during -XX:CRaCCheckpointTo");
> 2617:     return false;
> 2618:   }

This should be guaranteed by the arguments parsing since `CPUFeatures` is not restore-settable
Suggestion:

  assert(CPUFeatures == nullptr, "This should only be called on restore and CPUFeatures is not restore-settable");

src/hotspot/cpu/x86/vm_version_x86.cpp line 2645:

> 2643:     missing_features(features_missing, glibc_features_missing);
> 2644:     if (!IgnoreCPUFeatures) {
> 2645:       vm_exit_during_initialization();

We can just return false here now

src/hotspot/share/runtime/crac.cpp line 133:

> 131:         "with the selected CRaC engine");
> 132:       break;
> 133:   }

A UX improvement: if the user asks to ignore CPU features or the platform doesn't support them the user should not see warnings/errors about the engine not being able to store them.

Suggestion:

  if (!VM_Version::ignore_cpu_features()) {
    VM_Version::CPUFeaturesBinary data;
    if (VM_Version::cpu_features_binary(&data)) {
      switch (_engine->prepare_user_data_api()) {
        case CracEngine::ApiStatus::OK:
          if (!_engine->cpufeatures_store(&data)) {
            return JVM_CHECKPOINT_ERROR;
          }
          break;
        case CracEngine::ApiStatus::ERR:
          return JVM_CHECKPOINT_ERROR;
        case CracEngine::ApiStatus::UNSUPPORTED:
          log_warning(crac)("Cannot store CPUFeatures for checkpoint "
            "with the selected CRaC engine");
          break;
      }
    }
  }

src/hotspot/share/runtime/crac.cpp line 523:

> 521:   }
> 522: 
> 523:   if (!VM_Version::ignore_cpu_features()) {

`CPUFeatures` is not restore-settable and `IgnoreCPUFeatures`, rather unexpectedly, does not influence VM_Version::ignore_cpu_features(), so the user will see the warning/errors below even if they knowingly don't want to use CPU features (e.g. their engine doesn't support this).

But AFAIR we decided to postpone the `CPUFeatures=ignore`/`IgnoreCPUFeatures` cleanup and this can be considered a part of that, so I'm ok with leaving this as is for now.

src/hotspot/share/runtime/crac_engine.cpp line 439:

> 437: bool CracEngine::cpufeatures_store(const VM_Version::CPUFeaturesBinary *datap) const {
> 438:   log_debug(crac)("cpufeatures_store user data %s to %s...", cpufeatures_userdata_name, CRaCRestoreFrom);
> 439:   return _user_data_api->set_user_data(_conf, cpufeatures_userdata_name, datap, sizeof(*datap));

There's currently no error logging (neither here nor in the caller)
Suggestion:

  const bool ok = _user_data_api->set_user_data(_conf, cpufeatures_userdata_name, datap, sizeof(*datap));
  if (!ok) {
    log_error(crac)("CRaC engine failed to store user data %s", cpufeatures_userdata_name);
  }
  return ok;

src/java.base/share/native/libcrexec/crexec.cpp line 124:

> 122:   &restore_data_extension.header,
> 123:   &description_extension.header,
> 124:   &user_data_extension.header,

It is better to sort extensions by importance (more important first) for better performance since they are searched linearly by comparing names. Description should be the last one because it is currently only used for printing help and exiting so the speed doesn't matter.

Shouldn't matter much with such small amount of extensions but still.

src/java.base/share/native/libcrexec/crexec.cpp line 548:

> 546: };
> 547: 
> 548: static bool set_user_data(crlib_conf_t *conf, const char *name, const void *data, size_t size) {

A nitpick: you placed the user-data-related code between `Environment` class and `checkpoint`/`restore` functions which use the class — it makes the file a bit harder to read. A better place would be `get_restore_data` and `get_extension` functions (the extensions; functions are all placed above `get_extension`, grouped by extension).

src/java.base/share/native/libcrexec/crexec.cpp line 559:

> 557:   }
> 558:   FILE *f = fopen(fname, "w");
> 559:   if (f == NULL) {

A nitpick, would also be great to replace in the other places in the file
Suggestion:

  if (f == nullptr) {

src/java.base/share/native/libcrexec/crexec.cpp line 578:

> 576:   }
> 577:   if (fclose(f)) {
> 578:     fprintf(stderr, CREXEC "cannot write to %s: %s\n", fname, strerror(errno));

Suggestion:

    fprintf(stderr, CREXEC "cannot close %s: %s\n", fname, strerror(errno));

-------------

PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2087363676
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2087359222
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2087372826
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2087410841
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2087298409
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2087439546
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2087451640
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2087467814
PR Review Comment: https://git.openjdk.org/crac/pull/227#discussion_r2087468198


More information about the crac-dev mailing list