8327114: Attach in Linux may have wrong behaviour when pid == ns_pid (Kubernetes debug container)

Sebastian Lövdahl sebastian.lovdahl at hibox.tv
Sun Apr 28 19:33:06 UTC 2024


Hi all,

It seems like my fix for https://bugs.openjdk.org/browse/JDK-8226919 
regressed one use-case for Kubernetes debug containers (and other 
technically similar approaches). Quoting @jdoylei from 
https://github.com/openjdk/jdk/pull/17628#issuecomment-1969769654:

"We're running jcmd (OpenJDK build 17.0.10+7-LTS) and the target JVM in 
two separate containers in a Kubernetes pod. The target JVM container is 
already running, and then we use kubectl debug --target=... to start a 
Kubernetes debug container with jcmd that targets the first container. 
Given the --target option, they share the same Linux process namespace 
(both think the target JVM is PID 1). But since they are separate 
containers, they see different root filesystems (jcmd container sees the 
target JVM tmpdir under /proc/1/root/tmp but has its own distinct /tmp 
directory)."

I think I can confirm that there is a regression. Using a locally built 
JDK from master as of 2024-04-28 
(16c7dcdb04a7c220684a20eb4a0da4505ae03813), but using raw Docker 
containers instead of Kubernetes + kubectl debug:


slovdahl at ubuntu2204:~/reproducer$ cat Reproducer.java
import java.io.IOException;
import java.net.InetSocketAddress;
import java.net.ServerSocket;

public class Reproducer {
   public static void main(String[] args) throws InterruptedException, 
IOException {
     System.out.println("Hello, World!");
     try (var server = new ServerSocket()) {
       server.bind(new InetSocketAddress("localhost", 81));
       System.out.println("Bound to port 81");
       while (true) {
         Thread.sleep(1_000L);
       }
     }
   }
}

slovdahl at ubuntu2204:~/reproducer$ docker run --interactive --tty --rm 
--name app-container --volume 
~/jdk/build/linux-x86_64-server-release/images/jdk/:/jdk --volume .:/app 
--workdir /app ubuntu:22.04 /bin/bash
root at d1f87b8059ea:/app# /jdk/bin/java -version
openjdk version "23-internal" 2024-09-17
OpenJDK Runtime Environment (build 23-internal-adhoc.slovdahl.jdk)
OpenJDK 64-Bit Server VM (build 23-internal-adhoc.slovdahl.jdk, mixed 
mode, sharing)

root at d1f87b8059ea:/app# /jdk/bin/java Reproducer.java
Hello, World!
Bound to port 81


Locally built JDK and jcmd from the host (works):

slovdahl at ubuntu2204:~/reproducer$ sudo 
~/jdk/build/linux-x86_64-server-release/images/jdk/bin/jcmd 942781 
VM.version
942781:
OpenJDK 64-Bit Server VM version 23-internal-adhoc.slovdahl.jdk
JDK 23.0.0

jcmd from a sidecar Docker container mounted into the same process 
namespace (does NOT work, regressed):

slovdahl at ubuntu2204:~/reproducer$ docker run --interactive --tty --rm 
--pid=container:app-container --volume 
~/jdk/build/linux-x86_64-server-release/images/jdk/:/jdk ubuntu:22.04 
/bin/bash
root at 27d8be9186b7:/# /jdk/bin/jcmd
26 jdk.compiler/com.sun.tools.javac.launcher.SourceLauncher Reproducer.java
59 jdk.jcmd/sun.tools.jcmd.JCmd
root at 27d8be9186b7:/# /jdk/bin/jcmd 26 VM.version
26:
com.sun.tools.attach.AttachNotSupportedException: Unable to open socket 
file /tmp/.java_pid26: target process 26 doesn't respond within 10500ms 
or HotSpot VM not loaded
     at 
jdk.attach/sun.tools.attach.VirtualMachineImpl.<init>(VirtualMachineImpl.java:99)
     at 
jdk.attach/sun.tools.attach.AttachProviderImpl.attachVirtualMachine(AttachProviderImpl.java:58)
     at 
jdk.attach/com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:207)
     at jdk.jcmd/sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:113)
     at jdk.jcmd/sun.tools.jcmd.JCmd.main(JCmd.java:97)


Using Temurin 17.0.11 from the host (works):

slovdahl at ubuntu2204:~/reproducer$ 
/usr/lib/jvm/temurin-17-jdk-amd64/bin/java -version
openjdk version "17.0.11" 2024-04-16
OpenJDK Runtime Environment Temurin-17.0.11+9 (build 17.0.11+9)
OpenJDK 64-Bit Server VM Temurin-17.0.11+9 (build 17.0.11+9, mixed mode, 
sharing)
slovdahl at ubuntu2204:~/reproducer$ sudo 
/usr/lib/jvm/temurin-17-jdk-amd64/bin/jcmd 942781 VM.version
942781:
OpenJDK 64-Bit Server VM version 23-internal-adhoc.slovdahl.jdk
JDK 23.0.0


Temurin 17.0.11 jcmd from a sidecar Docker container mounted into the 
same process namespace (works):

slovdahl at ubuntu2204:~/reproducer$ docker run --interactive --tty --rm 
--pid=container:app-container eclipse-temurin:17.0.11_9-jdk-jammy /bin/bash
root at fcbd6e4be9eb:/# java -version
openjdk version "17.0.11" 2024-04-16
OpenJDK Runtime Environment Temurin-17.0.11+9 (build 17.0.11+9)
OpenJDK 64-Bit Server VM Temurin-17.0.11+9 (build 17.0.11+9, mixed mode, 
sharing)
root at fcbd6e4be9eb:/# jcmd
138 jdk.jcmd/sun.tools.jcmd.JCmd
26 jdk.compiler/com.sun.tools.javac.launcher.SourceLauncher Reproducer.java
root at fcbd6e4be9eb:/# jcmd 26 VM.version
26:
OpenJDK 64-Bit Server VM version 23-internal-adhoc.slovdahl.jdk
JDK 23.0.0


Curiously enough, there is a test that on the surface seemed to be 
written specifically for this case 
(test/hotspot/jtreg/containers/docker/TestJcmdWithSideCar.java). But the 
devil is in the details: in TestJcmdWithSideCar /tmp in the main 
container is a volume that is mounted into the sidecar container, so 
attaching from the sidecar works without going through /proc/<pid>/cwd, 
and hence it works both before and after my fix.

Knowing up front that /tmp needs to be a volume and that it needs to be 
mounted into the sidecar container feels like a hard ask to me, so I can 
definitely see why one would like to have the possibility to attach to 
containers without having to do that. So, I think it would make sense to 
get this regression fixed. Maybe also change the existing test to not 
mount /tmp between the containers? Or as an alternative, have tests for 
both the "mount /tmp" approach and for not doing it.

Thoughts about this? I could try to give it a look if you think it makes 
sense.

Best regards,

-- 
Sebastian Lövdahl
Senior Software Engineer, Hibox Systems - https://www.hibox.tv
sebastian.lovdahl at hibox.tv



More information about the serviceability-dev mailing list