ZGC: Failed to truncate backing file (Permission denied) at startup after 14.0.2 to 15.0.1 upgrade
David Tavoularis
David.Tavoularis at mycom-osi.com
Tue Nov 24 09:51:33 UTC 2020
Hi Per,
Thanks to your help, we identified that 15.0.1 (with ZGC) was starting
fine with recent container-selinux versions (2.33/2.36/2.42), but was
broken with older one (2.21) :
- Not working : container-selinux-2.21-2.gitba103ac.el7.noarch
- Working : container-selinux-2.33-1.git86f33cd.el7.noarch
- Working : container-selinux-2.36-1.gitff95335.el7.noarch
- Working : container-selinux-2.42-1.gitad8f0f7.el7.noarch
We will plan to upgrade the nodes running with old version.
Just for my information, is there a plan to support the old buggy
container-selinux in openjdk-15.0.2 by implementing a fallback to
fallocate when ftruncate returns EACCESS ?
Best Regards
--
David
On Mon, 23 Nov 2020 18:40:41 +0100, Per Liden <per.liden at oracle.com> wrote:
> Hi David,
>
> On 11/23/20 3:35 PM, David Tavoularis wrote:
>> Hi Per,
>> Thank you for your quick reply.
>>
>>> The fact that ftruncate returns EACCESS suggests that there's some
>>> kind of environment problem that is blocking the file from growing.
>>> In the production system, is this process running in some constrained
>>> environment/container/cgroup?
>> Yes, it is running inside a pod in an OpenShift environment.
>>
>>> To test this, run with JDK 14 and use "-Xms270g -Xmx270g"
>> I tried with "-Xms100g -Xmx100g", I hope that it is fine for this test.
>> $ /opt/3rd_party/jdk_installed/jdk-14.0.2/bin/java -Xms100g -Xmx100g
>> -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -Xlog:gc*=debug -version
>> [0.006s][debug][gc,heap] Minimum heap 107374182400 Initial heap
>> 107374182400 Maximum heap 107374182400
>> [0.006s][info ][gc,init] Initializing The Z Garbage Collector
>> [0.006s][info ][gc,init] Version: 14.0.2+12-46 (release)
>> [0.006s][info ][gc,init] NUMA Support: Disabled
>> [0.006s][info ][gc,init] CPUs: 70 total, 40 available
>> [0.006s][info ][gc,init] Memory: 419840M
>> [0.006s][info ][gc,init] Large Page Support: Disabled
>> [0.006s][info ][gc,init] Medium Page Size: 32M
>> [0.006s][info ][gc,init] Workers: 24 parallel, 5 concurrent
>> [0.009s][debug][gc,task] Executing Task: ZWorkersInitializeTask, Active
>> Workers: 24
>> [0.009s][info ][gc,init] Address Space Type:
>> Contiguous/Unrestricted/Complete
>> [0.009s][info ][gc,init] Address Space Size: 1638400M x 3 = 4915200M
>> [0.010s][info ][gc,init] Heap backed by file: /memfd:java_heap
>> [0.010s][info ][gc,init] Min Capacity: 102400M
>> [0.010s][info ][gc,init] Initial Capacity: 102400M
>> [0.010s][info ][gc,init] Max Capacity: 102400M
>> [0.010s][info ][gc,init] Max Reserve: 80M
>> [0.010s][info ][gc,init] Pre-touch: Disabled
>> [0.010s][info ][gc,init] Available space on backing filesystem: N/A
>> [25.372s][info ][gc,init] Uncommit: Disabled
>> [25.372s][debug][gc,marking] Expanding mark stack space: 0M->32M
>> [25.888s][info ][gc,init ] Runtime Workers: 24 parallel
>> [25.960s][info ][gc ] Using The Z Garbage Collector
>> [26.179s][debug][gc,nmethod] Rebuilding NMethod Table: 0->1024 entries,
>> 0(0%->0%) registered, 0(0%->0%) unregistered
>> openjdk version "14.0.2" 2020-07-14
>> OpenJDK Runtime Environment (build 14.0.2+12-46)
>> OpenJDK 64-Bit Server VM (build 14.0.2+12-46, mixed mode)
>> $ /opt/3rd_party/jdk_installed/jdk-15.0.1/bin/java -Xms100g -Xmx100g
>> -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -Xlog:gc*=debug -version
>> [0.006s][debug][gc,heap] Minimum heap 107374182400 Initial heap
>> 107374182400 Maximum heap 107374182400
>> [0.006s][info ][gc,init] Initializing The Z Garbage Collector
>> [0.006s][info ][gc,init] Version: 15.0.1+9-18 (release)
>> [0.006s][info ][gc,init] NUMA Support: Disabled
>> [0.006s][info ][gc,init] CPUs: 70 total, 40 available
>> [0.006s][info ][gc,init] Memory: 419840M
>> [0.006s][info ][gc,init] Large Page Support: Disabled
>> [0.007s][info ][gc,init] Workers: 24 parallel, 5 concurrent
>> [0.010s][debug][gc,task] Executing Task: ZWorkersInitializeTask, Active
>> Workers: 24
>> [0.011s][info ][gc,init] Address Space Type:
>> Contiguous/Unrestricted/Complete
>> [0.011s][info ][gc,init] Address Space Size: 1638400M x 3 = 4915200M
>> [0.012s][info ][gc,init] Heap Backing File: /memfd:java_heap
>> [0.012s][error][gc ] Failed to truncate backing file (Permission
>> denied)
>> [0.012s][debug][gc,marking] Expanding mark stack space: 0M->32M
>> [0.081s][info ][gc,init ] Runtime Workers: 24 parallel
>> [0.084s][info ][gc ] Using The Z Garbage Collector
>> Error: Could not create the Java Virtual Machine.
>> Error: A fatal exception has occurred. Program will exit.
>>
>>> The fact that ftruncate returns EACCESS suggests that there's some
>>> kind of environment problem
>> Is there a workaround or a way to fix the environment problem ?
>
> I looked around for Docker + ftruncate issues, and found a number of
> bugs related to this, for example:
>
> https://github.com/openshift/origin/issues/15723
> https://access.redhat.com/errata/RHBA-2018:0195
>
> These bugs are fairly old, but they seem to describe the exact same
> problem you're seeing. In short, it's an docker+selinux policy bug
> affecting memfd_create+ftruncate. Maybe you're production system haven't
> been patched with this bugfix?
>
> This would also explain why JDK 14 works, since we used fallocate
> (instead of ftruncate) to grow the file there, and therefore happened to
> dodge this issue.
>
> Please have a look at those links, and see if you might have run into
> the same issue.
>
> cheers,
> Per
>
>> Best Regards
>> David
>> On Mon, 23 Nov 2020 15:11:30 +0100, Per Liden <per.liden at oracle.com>
>> wrote:
>>
>>> Hi David,
>>>
>>> On 11/20/20 4:21 PM, David Tavoularis wrote:
>>>> Hi,
>>>> I found a possible regression in customer production environment
>>>> linked to JDK-8245203 "ZGC: Don't track size in
>>>> ZPhysicalMemoryBacking" or JDK15 "Fixed support for transparent huge
>>>> pages".
>>>> After upgrade from jdk-14.0.2 to jdk-15.0.1, the JVM (using ZGC) is
>>>> crashing at startup with the error message "Failed to truncate
>>>> backing file (Permission denied)"
>>>> This error message was introduced in changelist
>>>> http://hg.openjdk.java.net/jdk-updates/jdk15u/rev/556d5070c458
>>>> Cf source code :
>>>> http://hg.openjdk.java.net/jdk-updates/jdk15u/file/556d5070c458/src/hotspot/os/linux/gc/z/zPhysicalMemoryBacking_linux.cpp
>>>> The issue was seen in Prod, not in Test, nor in labs, I suspect
>>>> that Heap Backing Filesystem is not correctly set to tmpfs in Prod
>>>> or a permission issue on a directory/file owned by root.
>>>> $ /opt/3rd_party/jdk_installed/jdk-15.0.1/bin/java -version
>>>> openjdk version "15.0.1" 2020-10-20
>>>> OpenJDK Runtime Environment (build 15.0.1+9-18)
>>>> OpenJDK 64-Bit Server VM (build 15.0.1+9-18, mixed mode, sharing)
>>>> $ /opt/3rd_party/jdk_installed/jdk-15.0.1/bin/java -XX:+UseZGC
>>>> -version
>>>> [0.009s][error][gc] Failed to truncate backing file (Permission
>>>> denied)
>>>> Error: Could not create the Java Virtual Machine.
>>>> Error: A fatal exception has occurred. Program will exit.
>>>> $ /opt/mycom/3rd_party/jdk_installed/jdk-14.0.2/bin/java
>>>> -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -version
>>>> openjdk version "14.0.2" 2020-07-14
>>>> OpenJDK Runtime Environment (build 14.0.2+12-46)
>>>> OpenJDK 64-Bit Server VM (build 14.0.2+12-46, mixed mode)
>>>> Should I fill a bug at https://bugreport.java.com/bugreport/ ?
>>>> What additional information should I provide ?
>>>> Any workaround ?
>>>
>>> It's hard to tell exactly why this fails. One difference between JDK
>>> 14 and 15 is that the backing file will be truncated to the max heap
>>> size up-front when the JVM starts (but it will not be backed by any
>>> memory until we actually start to expand the heap), instead of
>>> gradually grown as the heap expands. So, one theory is that you could
>>> have had this problem even with 14, but the heap might not have grown
>>> large enough for the problem to be exposed. To test this, run with JDK
>>> 14 and use "-Xms270g -Xmx270g". This will force the backing file to be
>>> truncated up-front, similar to what happens in JDK 15.
>>>
>>> Also running with "-Xlog:gc*=debug" will print a few more things that
>>> might be interesting.
>>>
>>> The fact that ftruncate returns EACCESS suggests that there's some
>>> kind of environment problem that is blocking the file from growing. In
>>> the production system, is this process running in some constrained
>>> environment/container/cgroup?
>>>
>>> cheers,
>>> Per
More information about the zgc-dev
mailing list