Massive slowdown in regex processing while GC log fills with ICBufferFull messages
David Holmes
david.holmes at oracle.com
Mon Aug 1 00:58:53 UTC 2022
Hi Paul,
Redirecting to hotspot-dev as this issue is not likely to get much
visibility on jdk-updates-dev.
Probably worth filing an issue in bugs.openjdk.org if you can.
Cheers,
David
On 1/08/2022 10:37 am, Paul Wagland wrote:
> Hi all,
>
> We have been running into a problem with Java 17.0.1, 17.0.2, and now
> 17.0.3.
>
> # java --version
> openjdk 17.0.3 2022-04-19
> OpenJDK Runtime Environment Temurin-17.0.3+7 (build 17.0.3+7)
> OpenJDK 64-Bit Server VM Temurin-17.0.3+7 (build 17.0.3+7, mixed mode,
> sharing)
>
>
> The issue that we are seeing is very similar in behaviour to
> https://bugs.openjdk.org/browse/JDK-8277212
> <https://bugs.openjdk.org/browse/JDK-8277212>, with a couple of notable
> exceptions:
> 1. We are seeing this issue in 17.0.3, whereas JDK-8277212 was resolved
> in either 17.0.2 or 17.0.3-oracle.
> 2. We are using G1GC, not ZGC
>
> Unfortunately, we don’t know how to reproduce this issue, we are seeing
> it happen about once a week across our fleet. We have about 300
> instances of Java running various different workloads.
>
> What happens is that on a machine where this occurs, after a period of
> time we see CPU usage suddenly spike, and the gc logs suddenly start
> getting flooded with ICBufferFull messages, at the rate of about 430 a
> second. When doing thread dumps to see what is happening at that time,
> we find most threads end up very deep in a regex processing stack,
> unfortunately we have one of these fairly central to the product, so
> this ends up affecting most threads. It seems that almost everything
> that isn’t thread parsing operates at, or near to, full speed. This is
> one of the reasons that this feels similar to JDK-8277212, since that
> was mashing the VTABLES, and because of the way that regexes work, they
> would appear to use a lot of VTABLE lookups, so any slowdown in that
> area, is likely to affect them greatly. Note that this is pure
> supposition, and I would love to know if there is any way if we can
> prove, or disprove this, either via testing, or additional logging.
>
> Unfortunately, as mentioned above, we are not able to reproduce this in
> our test environments, and we are only seeing this in customer
> environments, and then not with any clearly identifiable cause. We have
> seen the issue about a dozen times, over 17.0.1, 17.0.2, and recently,
> 17.0.3.
>
> We have done some performance testing, and during the slow-down we get,
> for our regex processing test:
>
> 10 iterations of testString 1 took 10700ms (1070004460ns/match)
> (matched=false)
> 10 iterations of testString 2 took 8965ms (896560879ns/match) (matched=true)
>
> Normal timings looks like:
> 18503 iterations of testString 1 took 1000ms (54047ns/match) (matched=false)
> 40954 iterations of testString 2 took 1000ms (24417ns/match) (matched=true)
>
> So that represents a slowdown of roughly 20,000 times!
>
> We have another test that is mostly map lookup heavy, and during the
> high CPU it has:
> doMapOperations:2:31076ns
> doMapOperations:3:16951ns
> doMapOperations:4:14192ns
>
> While normally it has:
> doMapOperations:2:27392ns
> doMapOperations:3:7987ns
> doMapOperations:4:6079ns
>
> So a slowdown of maybe a factor of 2, which is perfectly explainable by
> the very high CPU/system load.
>
> Our startup arguments are:
>
> JVM startup arguments:
> -javaagent:/opt/jmx-exporter/jmx-exporter.jar=9404:/opt/jmx-exporter/config.yaml
> --add-opens=java.base/java.lang=ALL-UNNAMED
> --add-opens=java.base/java.io <http://java.io>=ALL-UNNAMED
> --add-opens=java.base/java.util=ALL-UNNAMED
> --add-opens=java.base/java.util.concurrent=ALL-UNNAMED
> --add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMED
> -Dnop
> -Djava.security.egd=file:/dev/./urandom
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/opt/…/logs
> -Djava.awt.headless=true
> -XX:+PrintFlagsFinal
> -DRP_CONTEXTPATH=
> -DRP_UNPACKWAR=true
> -XX:FlightRecorderOptions=stackdepth=256
> -XX:+UnlockDiagnosticVMOptions
> -XX:+DebugNonSafepoints
> -Xlog:gc*,safepoint=info:file=/opt/…/logs/verbosegc.log:time,uptimemillis:filecount=10,filesize=50M
> -Xlog:class*=info:file=/opt/…/logs/classload.log:time,uptimemillis,level,tags:filecount=10,filesize=50M
> -XX:+UseG1GC
> -XX:G1PeriodicGCInterval=10000
> -XX:+ParallelRefProcEnabled
> -XX:MaxDirectMemorySize=25M
> -XX:MaxMetaspaceSize=300M
> -XX:MetaspaceSize=100M
> -Xss1024K
> -XX:ReservedCodeCacheSize=150M
> -XX:+SegmentedCodeCache
> -Xmx862m
> -XX:MaxMetaspaceFreeRatio=20
> -XX:MinMetaspaceFreeRatio=5
> -XX:MaxHeapFreeRatio=40
> -XX:MinHeapFreeRatio=5
> -Djdk.nio.maxCachedBufferSize=8192
> -XX:SoftMaxHeapSize=646m
> -Dlog4j2.disableJmx=true
> -Djdk.tls.ephemeralDHKeySize=2048
> -Djava.protocol.handler.pkgs=org.apache.catalina.webresources
> -Dorg.apache.catalina.security.SecurityListener.UMASK=0027
> -Dignore.endorsed.dirs=
> -Dcatalina.base=/opt/…/server1
> -Dcatalina.home=/opt/…/tomcat
> -Djava.io.tmpdir=/opt/…/temp
>
> Can anyone think of anything that we can do to ameliorate this problem,
> apart from not using regexes, which we are moving towards, but I think
> that they are the victim here, not the cause of the problem, since in
> normal use, they are “fast enough”.
>
> If there is any information that I can add, please let me know!
>
> Cheers,
> Paul
More information about the hotspot-dev
mailing list