Massive slowdown in regex processing while GC log fills with ICBufferFull messages

Mon Aug 1 00:58:53 UTC 2022

Hi Paul,

Redirecting to hotspot-dev as this issue is not likely to get much 
visibility on jdk-updates-dev.

Probably worth filing an issue in bugs.openjdk.org if you can.

Cheers,
David

On 1/08/2022 10:37 am, Paul Wagland wrote:
> Hi all,
> 
> We have been running into a problem with Java 17.0.1, 17.0.2, and now 
> 17.0.3.
> 
> # java --version
> openjdk 17.0.3 2022-04-19
> OpenJDK Runtime Environment Temurin-17.0.3+7 (build 17.0.3+7)
> OpenJDK 64-Bit Server VM Temurin-17.0.3+7 (build 17.0.3+7, mixed mode, 
> sharing)
> 
> 
> The issue that we are seeing is very similar in behaviour to 
> https://bugs.openjdk.org/browse/JDK-8277212 
> <https://bugs.openjdk.org/browse/JDK-8277212>, with a couple of notable 
> exceptions:
> 1. We are seeing this issue in 17.0.3, whereas JDK-8277212 was resolved 
> in either 17.0.2 or 17.0.3-oracle.
> 2. We are using G1GC, not ZGC
> 
> Unfortunately, we don’t know how to reproduce this issue, we are seeing 
> it happen about once a week across our fleet. We have about 300 
> instances of Java running various different workloads.
> 
> What happens is that on a machine where this occurs, after a period of 
> time we see CPU usage suddenly spike, and the gc logs suddenly start 
> getting flooded with ICBufferFull messages, at the rate of about 430 a 
> second. When doing thread dumps to see what is happening at that time, 
> we find most threads end up very deep in a regex processing stack, 
> unfortunately we have one of these fairly central to the product, so 
> this ends up affecting most threads. It seems that almost everything 
> that isn’t thread parsing operates at, or near to, full speed. This is 
> one of the reasons that this feels similar to JDK-8277212, since that 
> was mashing the VTABLES, and because of the way that regexes work, they 
> would appear to use a lot of VTABLE lookups, so any slowdown in that 
> area, is likely to affect them greatly. Note that this is pure 
> supposition, and I would love to know if there is any way if we can 
> prove, or disprove this, either via testing, or additional logging.
> 
> Unfortunately, as mentioned above, we are not able to reproduce this in 
> our test environments, and we are only seeing this in customer 
> environments, and then not with any clearly identifiable cause. We have 
> seen the issue about a dozen times, over 17.0.1, 17.0.2, and recently, 
> 17.0.3.
> 
> We have done some performance testing, and during the slow-down we get, 
> for our regex processing test:
> 
> 10 iterations of testString 1 took 10700ms (1070004460ns/match) 
> (matched=false)
> 10 iterations of testString 2 took 8965ms (896560879ns/match) (matched=true)
> 
> Normal timings looks like:
> 18503 iterations of testString 1 took 1000ms (54047ns/match) (matched=false)
> 40954 iterations of testString 2 took 1000ms (24417ns/match) (matched=true)
> 
> So that represents a slowdown of roughly 20,000 times!
> 
> We have another test that is mostly map lookup heavy, and during the 
> high CPU it has:
> doMapOperations:2:31076ns
> doMapOperations:3:16951ns
> doMapOperations:4:14192ns
> 
> While normally it has:
> doMapOperations:2:27392ns
> doMapOperations:3:7987ns
> doMapOperations:4:6079ns
> 
> So a slowdown of maybe a factor of 2, which is perfectly explainable by 
> the very high CPU/system load.
> 
> Our startup arguments are:
> 
> JVM startup arguments:
> -javaagent:/opt/jmx-exporter/jmx-exporter.jar=9404:/opt/jmx-exporter/config.yaml
> --add-opens=java.base/java.lang=ALL-UNNAMED
> --add-opens=java.base/java.io  <http://java.io>=ALL-UNNAMED
> --add-opens=java.base/java.util=ALL-UNNAMED
> --add-opens=java.base/java.util.concurrent=ALL-UNNAMED
> --add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMED
> -Dnop
> -Djava.security.egd=file:/dev/./urandom
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/opt/…/logs
> -Djava.awt.headless=true
> -XX:+PrintFlagsFinal
> -DRP_CONTEXTPATH=
> -DRP_UNPACKWAR=true
> -XX:FlightRecorderOptions=stackdepth=256
> -XX:+UnlockDiagnosticVMOptions
> -XX:+DebugNonSafepoints
> -Xlog:gc*,safepoint=info:file=/opt/…/logs/verbosegc.log:time,uptimemillis:filecount=10,filesize=50M
> -Xlog:class*=info:file=/opt/…/logs/classload.log:time,uptimemillis,level,tags:filecount=10,filesize=50M
> -XX:+UseG1GC
> -XX:G1PeriodicGCInterval=10000
> -XX:+ParallelRefProcEnabled
> -XX:MaxDirectMemorySize=25M
> -XX:MaxMetaspaceSize=300M
> -XX:MetaspaceSize=100M
> -Xss1024K
> -XX:ReservedCodeCacheSize=150M
> -XX:+SegmentedCodeCache
> -Xmx862m
> -XX:MaxMetaspaceFreeRatio=20
> -XX:MinMetaspaceFreeRatio=5
> -XX:MaxHeapFreeRatio=40
> -XX:MinHeapFreeRatio=5
> -Djdk.nio.maxCachedBufferSize=8192
> -XX:SoftMaxHeapSize=646m
> -Dlog4j2.disableJmx=true
> -Djdk.tls.ephemeralDHKeySize=2048
> -Djava.protocol.handler.pkgs=org.apache.catalina.webresources
> -Dorg.apache.catalina.security.SecurityListener.UMASK=0027
> -Dignore.endorsed.dirs=
> -Dcatalina.base=/opt/…/server1
> -Dcatalina.home=/opt/…/tomcat
> -Djava.io.tmpdir=/opt/…/temp
> 
> Can anyone think of anything that we can do to ameliorate this problem, 
> apart from not using regexes, which we are moving towards, but I think 
> that they are the victim here, not the cause of the problem, since in 
> normal use, they are “fast enough”.
> 
> If there is any information that I can add, please let me know!
> 
> Cheers,
> Paul