[crac] RFR: PID adjustment on checkpoint [v3]

Radim Vansa rvansa at openjdk.org
Thu Jun 22 13:53:41 UTC 2023

On Thu, 22 Jun 2023 13:28:28 GMT, Roman Marchenko <rmarchenko at openjdk.org> wrote:

>> If the value was explicitly set, I think it would be better to fail. When it's trying to get to PID 128 by default I think it is sufficient to warn the user **and** tell him that he could switch off the warning setting `-XX:CRMinPid=1`.
> I did some experiments with PID spinning and a desired PID value that exceeds max_pid. It takes too long to spin PID until PID overflows. In case of a wrong value set by an user, this may seem like java hangs, so the user cannot wait so long to see the error message. This is also true for a valid desired PID value which is pretty big, e.g. 2_000_000.
> We could remove `waitpid()` call to speed up PID spinning, but by removing this, we can easily reach container's resource limits (I tested this), so we cannot remove `waitpid` easily.
> To avoid reading from `kernel/pid_max` and to avoid hanging on PID spinning, we could introduce max number of spin tries, say 10_000. If we reach this limit while spinning PIDs, we'd stop spinning and continue run Java with the currently reached PID. It actually seems doubtful that users want to move PID to 2M starting with PID=1 or 8 in a container. If users have some processes running in their container, on checkpoint they'd adjust desirable PID value in accordance with the state of the container, limited by a max try count we introducing. This solution seems portable for POSIX-like platforms.
> Or, since things're becoming so complicated, it'd be easier to read `pid_max`, only for Linux though.
> Please note that I'm talking about PID spinning, i.e. a case when writing to `ns_last_pid` haven't worked for some reasons.
> Are there any additional pro/cons?

I agree that setting min pid to 2M is unlikely to be practical. Spinning fixed number of times is simpler, but if anything like this happens it would make sense to print a warning message (e.g. We are cycling PIDs due to -XX:CRMinPID=200000) after certain elapsed time (3 seconds?) rather than just number of iterations. Then the user can decide to cancel. Please make sure the stream gets flushed (in case this is written to a file or so...).


PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1238559758

More information about the crac-dev mailing list