Problems with /var/lib/sss/mc/passwd

Radim Vansa rvansa at azul.com
Mon May 22 08:25:00 UTC 2023


Hi,

I've replied on the forums [1], please continue in there.

Cheers, Radim

[1] 
https://forums.foojay.io/forums/topic/problems-with-var-lib-sss-mc-passwd/#post-138

On 19. 05. 23 22:58, Jack Koenig wrote:
> Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
>
> Hello Radim,
>
> Thank you for your response, sorry for breaking the thread--I had
> digests on and cannot figure out how to set "In-Reply-To" from gmail.
>
> `-XX:CRaCIgnoredFileDescriptors=/var/lib/sss/mc/passwd` sounds like
> exactly what I need, unfortunately it doesn't seem to work in this
> case, no idea why but with it set I get the exact same error. I have
> tried to reproduce in both CentOS and Ubuntu Docker containers but
> have been unsuccessful--the circumstances that lead to this situation
> are beyond my Linux knowledge.
>
> In any case, I was able to make forward progress by using gdb to force
> close the file descriptor (lol). For anyone in the future who comes
> across this thread, you can just determine the PID of the process you
> wish to checkpoint, and determine the file descriptor number for
> /var/lib/sss/mc/passwd (for me it was always 4 which is interesting),
> then do the following:
> $ gdb -p <pid>
> (gdb) call (int)close(<fd>)
> (gdb) quit
>
> After force closing the file descriptor I was able to take a checkpoint.
>
> Now, with a successful checkpoint I then tried to restore from the
> checkpoint and failed with:
>
> Error (criu/cr-restore.c:1335): Failed to write 897973 to
> /proc/sys/kernel/ns_last_pid: Operation not permitted
> Error (criu/cr-restore.c:1506): Can't fork for 897974: Operation not permitted
> Error (criu/cr-restore.c:2593): Restoring FAILED.
> Error (criu/cr-restore.c:1823): Pid 915630 do not match expected 897974
>
> Since my goal is to create many processes from the same checkpoint,
> needing the same PID is going to be problematic, so I've started
> trying to see if I can use unshare to create a namespace.
>
> When I create a new namespace with:
> unshare -mrp --mount-proc --fork
>
> And then run the process I wish to checkpoint, to my pleasant
> surprise, /var/lib/sss/mc/passwd is not open, so this seems to
> coincidentally solve that issue.
>
> However, I am not able to create a checkpoint, when I run
> `jcmd <pid> JDK.checkpoint` I get:
>
> JVM: invalid info for restore provided: queued code -1
> An exception during a checkpoint operation:
> jdk.internal.crac.CheckpointException
>          at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:141)
>          at java.base/jdk.internal.crac.Core.checkpointRestore(Core.java:246)
>          at java.base/jdk.internal.crac.Core.checkpointRestoreInternal(Core.java:262)
>
> The error isn't super precise, but I suspect the issue is that jcmd
> cannot find the process, if I run `jcmd -l`, nothing shows up. Note I
> am running this jcmd in the same namespace, but clearly I have done
> something wrong.
>
> If I try to create a checkpoint from outside the namespace using the
> real PID, the process prints a stack trace and the checkpoint fails
> with:
>
> com.sun.tools.attach.AttachNotSupportedException: Unable to open
> socket file: target process not responding or HotSpot VM not loaded
>          at sun.tools.attach.LinuxVirtualMachine.<init>(LinuxVirtualMachine.java:106)
>          at sun.tools.attach.LinuxAttachProvider.attachVirtualMachine(LinuxAttachProvider.java:63)
>          at com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:208)
>          at sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:147)
>          at sun.tools.jcmd.JCmd.main(JCmd.java:131)
>
> Does anyone have any experience here? Is this approach of using
> unshare to create a new namespace going in the right direction?
>
> Thank you!
> Jack
>
> On Thu, 18 May 2023 11:58:04 +0200 Radim Vansa <rvansa at azul.com> wrote:
>> Hello Jack,
>>
>> the proper venue could be the Foojay.io forums [1] (yes, only recently
>> created) or #crac channel on Foojay slack, but this list will do :)
>>
>> Can you try running the checkpoint with
>> `-XX:CRaCIgnoredFileDescriptors=/var/lib/sss/mc/passwd` ? This should
>> bypass the checks, though problems may arise on restore if this file
>> changes when the application is in checkpoint.
>>
>> Radim
>>
>> [1]
>> https://forums.foojay.io/forums/forum/coordinated-restore-at-checkpoint-crac/
>>
>> On 18. 05. 23 3:37, Jack Koenig wrote:
>>>
>>> Caution: This email originated from outside of the organization. Do
>>> not click links or open attachments unless you recognize the sender
>>> and know the content is safe.
>>>
>>>
>>> Hello everyone,
>>>
>>> This is more of a user question, so I apologize if this is the wrong
>>> venue--please direct me to the right place as appropriate.
>>>
>>> I am attempting to checkpoint my application but I get an exception
>>> saying that /var/lib/sss/mc/passwd is open:
>>>
>>> An exception during a checkpoint operation:
>>>
>>> jdk.internal.crac.CheckpointException
>>> ? ? ? ? at
>>> java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:141)
>>> ? ? ? ? at
>>> java.base/jdk.internal.crac.Core.checkpointRestore(Core.java:246)
>>> ? ? ? ? at
>>> java.base/jdk.internal.crac.Core.checkpointRestoreInternal(Core.java:262)
>>> ? ? ? ? Suppressed:
>>> jdk.internal.crac.impl.CheckpointOpenFileException: /var/lib/sss/mc/passwd
>>> ? ? ? ? ? ? ? ? at
>>> java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:87)
>>> ? ? ? ? ? ? ? ? at
>>> java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
>>> ? ? ? ? ? ? ? ? ... 2 more
>>>
>>> The only thing I've found mentioning a similar issue is this old
>>> thread:
>>> https://mail.openjdk.org/pipermail/crac-dev/2022-January/000079.html
>>>
>>> The workaround posted there involves system-level configuration
>>> changes, but I am an unprivileged user on a shared RHEL8 machine so
>>> cannot apply such a workaround.
>>>
>>> Is there anything I can do to resolve or at least workaround this issue?
>>>
>>> Cheers,
>>> Jack


More information about the crac-dev mailing list