[crac] RFR: PID adjustment on checkpoint [v3]

Roman Marchenko rmarchenko at openjdk.org
Fri Jun 23 13:03:51 UTC 2023


On Thu, 22 Jun 2023 13:48:33 GMT, Radim Vansa <rvansa at openjdk.org> wrote:

>> I did some experiments with PID spinning and a desired PID value that exceeds max_pid. It takes too long to spin PID until PID overflows. In case of a wrong value set by an user, this may seem like java hangs, so the user cannot wait so long to see the error message. This is also true for a valid desired PID value which is pretty big, e.g. 2_000_000.
>> 
>> We could remove `waitpid()` call to speed up PID spinning, but by removing this, we can easily reach container's resource limits (I tested this), so we cannot remove `waitpid` easily.
>> 
>> To avoid reading from `kernel/pid_max` and to avoid hanging on PID spinning, we could introduce max number of spin tries, say 10_000. If we reach this limit while spinning PIDs, we'd stop spinning and continue run Java with the currently reached PID. It actually seems doubtful that users want to move PID to 2M starting with PID=1 or 8 in a container. If users have some processes running in their container, on checkpoint they'd adjust desirable PID value in accordance with the state of the container, limited by a max try count we introducing. This solution seems portable for POSIX-like platforms.
>> 
>> Or, since things're becoming so complicated, it'd be easier to read `pid_max`, only for Linux though.
>> 
>> Please note that I'm talking about PID spinning, i.e. a case when writing to `ns_last_pid` haven't worked for some reasons.
>> 
>> Are there any additional pro/cons?
>
> I agree that setting min pid to 2M is unlikely to be practical. Spinning fixed number of times is simpler, but if anything like this happens it would make sense to print a warning message (e.g. We are cycling PIDs due to -XX:CRMinPID=200000) after certain elapsed time (3 seconds?) rather than just number of iterations. Then the user can decide to cancel. Please make sure the stream gets flushed (in case this is written to a file or so...).

Implemented PID spinning up to 10000 times. The 10000 is hardcoded now, we'd introduce an option if needed in future.

-------------

PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1239787503


More information about the crac-dev mailing list