Problems with /var/lib/sss/mc/passwd

Jack Koenig jack.koenig3 at gmail.com
Fri May 19 20:58:16 UTC 2023


Hello Radim,

Thank you for your response, sorry for breaking the thread--I had
digests on and cannot figure out how to set "In-Reply-To" from gmail.

`-XX:CRaCIgnoredFileDescriptors=/var/lib/sss/mc/passwd` sounds like
exactly what I need, unfortunately it doesn't seem to work in this
case, no idea why but with it set I get the exact same error. I have
tried to reproduce in both CentOS and Ubuntu Docker containers but
have been unsuccessful--the circumstances that lead to this situation
are beyond my Linux knowledge.

In any case, I was able to make forward progress by using gdb to force
close the file descriptor (lol). For anyone in the future who comes
across this thread, you can just determine the PID of the process you
wish to checkpoint, and determine the file descriptor number for
/var/lib/sss/mc/passwd (for me it was always 4 which is interesting),
then do the following:
$ gdb -p <pid>
(gdb) call (int)close(<fd>)
(gdb) quit

After force closing the file descriptor I was able to take a checkpoint.

Now, with a successful checkpoint I then tried to restore from the
checkpoint and failed with:

Error (criu/cr-restore.c:1335): Failed to write 897973 to
/proc/sys/kernel/ns_last_pid: Operation not permitted
Error (criu/cr-restore.c:1506): Can't fork for 897974: Operation not permitted
Error (criu/cr-restore.c:2593): Restoring FAILED.
Error (criu/cr-restore.c:1823): Pid 915630 do not match expected 897974

Since my goal is to create many processes from the same checkpoint,
needing the same PID is going to be problematic, so I've started
trying to see if I can use unshare to create a namespace.

When I create a new namespace with:
unshare -mrp --mount-proc --fork

And then run the process I wish to checkpoint, to my pleasant
surprise, /var/lib/sss/mc/passwd is not open, so this seems to
coincidentally solve that issue.

However, I am not able to create a checkpoint, when I run
`jcmd <pid> JDK.checkpoint` I get:

JVM: invalid info for restore provided: queued code -1
An exception during a checkpoint operation:
jdk.internal.crac.CheckpointException
        at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:141)
        at java.base/jdk.internal.crac.Core.checkpointRestore(Core.java:246)
        at java.base/jdk.internal.crac.Core.checkpointRestoreInternal(Core.java:262)

The error isn't super precise, but I suspect the issue is that jcmd
cannot find the process, if I run `jcmd -l`, nothing shows up. Note I
am running this jcmd in the same namespace, but clearly I have done
something wrong.

If I try to create a checkpoint from outside the namespace using the
real PID, the process prints a stack trace and the checkpoint fails
with:

com.sun.tools.attach.AttachNotSupportedException: Unable to open
socket file: target process not responding or HotSpot VM not loaded
        at sun.tools.attach.LinuxVirtualMachine.<init>(LinuxVirtualMachine.java:106)
        at sun.tools.attach.LinuxAttachProvider.attachVirtualMachine(LinuxAttachProvider.java:63)
        at com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:208)
        at sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:147)
        at sun.tools.jcmd.JCmd.main(JCmd.java:131)

Does anyone have any experience here? Is this approach of using
unshare to create a new namespace going in the right direction?

Thank you!
Jack

On Thu, 18 May 2023 11:58:04 +0200 Radim Vansa <rvansa at azul.com> wrote:
>
> Hello Jack,
>
> the proper venue could be the Foojay.io forums [1] (yes, only recently
> created) or #crac channel on Foojay slack, but this list will do :)
>
> Can you try running the checkpoint with
> `-XX:CRaCIgnoredFileDescriptors=/var/lib/sss/mc/passwd` ? This should
> bypass the checks, though problems may arise on restore if this file
> changes when the application is in checkpoint.
>
> Radim
>
> [1]
> https://forums.foojay.io/forums/forum/coordinated-restore-at-checkpoint-crac/
>
> On 18. 05. 23 3:37, Jack Koenig wrote:
> >
> >
> > Caution: This email originated from outside of the organization. Do
> > not click links or open attachments unless you recognize the sender
> > and know the content is safe.
> >
> >
> > Hello everyone,
> >
> > This is more of a user question, so I apologize if this is the wrong
> > venue--please direct me to the right place as appropriate.
> >
> > I am attempting to checkpoint my application but I get an exception
> > saying that /var/lib/sss/mc/passwd is open:
> >
> > An exception during a checkpoint operation:
> >
> > jdk.internal.crac.CheckpointException
> > ? ? ? ? at
> > java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:141)
> > ? ? ? ? at
> > java.base/jdk.internal.crac.Core.checkpointRestore(Core.java:246)
> > ? ? ? ? at
> > java.base/jdk.internal.crac.Core.checkpointRestoreInternal(Core.java:262)
> > ? ? ? ? Suppressed:
> > jdk.internal.crac.impl.CheckpointOpenFileException: /var/lib/sss/mc/passwd
> > ? ? ? ? ? ? ? ? at
> > java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:87)
> > ? ? ? ? ? ? ? ? at
> > java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
> > ? ? ? ? ? ? ? ? ... 2 more
> >
> > The only thing I've found mentioning a similar issue is this old
> > thread:
> > https://mail.openjdk.org/pipermail/crac-dev/2022-January/000079.html
> >
> > The workaround posted there involves system-level configuration
> > changes, but I am an unprivileged user on a shared RHEL8 machine so
> > cannot apply such a workaround.
> >
> > Is there anything I can do to resolve or at least workaround this issue?
> >
> > Cheers,
> > Jack


More information about the crac-dev mailing list