Thread-Local Handshakes slowdown on single-cpu machines
Karakó Miklós
palacsint at palacsint.hu
Wed Apr 29 03:26:43 UTC 2020
Hi Patricio,
Yes, it seems like a scheduling issue for us too. One of my colleagues
spotted the following sentence in Robbin Ehn's Handshaking HotSpot
presentation (https://youtu.be/VBCOfAJ409s?t=2222): ~"if the VM thread
finds that all threads are unsafe, it keeps spinning"
Could it be that actually the VM thread gets more CPU time and actually
blocks JavaThread?
I did not except that but I could reproduce the issue with a simple
hello world:
public class Main {
public static void main(String[] args) {
System.out.println("Hello world");
}
}
A few measurements on Windows 10 x64 in VirtualBox:
PS Z:\main> Measure-Command { .\main-1cpu.bat }
Minutes : 0
Seconds : 7
Milliseconds : 20
Ticks : 70203455
TotalMinutes : 0.117005758333333
TotalSeconds : 7.0203455
TotalMilliseconds : 7020.3455
PS Z:\main> Measure-Command { .\main-full.bat }
Minutes : 0
Seconds : 1
Milliseconds : 575
Ticks : 15751352
TotalSeconds : 1.5751352
TotalMilliseconds : 1575.1352
PS Z:\main> Measure-Command { .\main-full.bat }
Minutes : 0
Seconds : 1
Milliseconds : 558
Ticks : 15585384
TotalSeconds : 1.5585384
TotalMilliseconds : 1558.5384
PS Z:\main> Measure-Command { .\main-1cpu.bat }
Minutes : 0
Seconds : 23
Milliseconds : 648
Ticks : 236482685
TotalSeconds : 23.6482685
TotalMilliseconds : 23648.2685
main-1cpu.bat:
c:\windows\system32\cmd.exe /C start /affinity 1 \jdk-14\bin\java
-Xlog:all=trace:file=app-1cpu.txt:uptime,tid,level,tags:filecount=50 Main
main-full.bat:
c:\windows\system32\cmd.exe /C start /affinity FF \jdk-14\bin\java
-Xlog:all=trace:file=app-full.txt:uptime,tid,level,tags:filecount=50 Main
The files are also available here:
https://palacsint.hu/etc/jdk/jdk14-single-cpu-slowdown-reprod-poc/
Or in a zip:
https://palacsint.hu/etc/jdk/jdk14-single-cpu-slowdown-reprod-poc.zip
You can find trace level logs of three testruns in the run1-2-3
directories as well.
Best,
Miklos
On 4/29/20 2:18 AM, Patricio Chilano wrote:
> Hi Miklos,
>
> On 4/28/20 8:04 AM, Karakó Miklós wrote:
>> Hello David,
>>
>> I've tried with OpenJDK 15 EA Build 20. Unfortunately it's slow as
>> with OpenJDK14.
>> A colleague of mine set CPU affinity with taskset to only one CPU on
>> Linux and did not experience the slowdown. So the issue seems
>> Windows-related.
> I wouldn’t rule out this being a scheduling issue. With thread-local
> handshakes some operations like the ones you see in your logs
> (deoptimization, biasedlocking revocation) do not need all JavaThreads
> to be stopped at the same time in order to be executed. If you only
> have one CPU but many JavaThreads running, depending on the scheduler,
> maybe it could happen that the JavaThread you need to handshake is
> constantly being left behind in the scheduler queue and so the
> operation takes longer to execute. When you add
> -XX:-ThreadLocalHandshakes, those operations will be executed using
> safepoints, which means all JavaThreads are stopped. So even if the
> scheduler decides to give priority to other JavaThreads those will
> block and free the CPU. Also when adding more CPU that will increase
> the likelihood of a JavaThread getting scheduled and might explain why
> you see it gets fixed.
>
> Alternatively the JavaThread that needs to be handshaked is being
> scheduled but is not polling for the handshake. But I think in that
> the case you should still have the same issue with
> -XX:-ThreadLocalHandshakes because the polling mechanism is the same,
> although it's true that the misbehaving JavaThread will have more CPU
> time to complete whatever is doing while others are stopped when
> safepointing.
>
> One thing that makes me question the scheduler theory is the fact that
> you see such improvements when only disabling biased-locking in 13.0.2
> because that was still using safepoints back then. Can you check the
> logs to see in which operations you see the pauses when running with
> default options compared to -XX:-UseBiasedLocking?
>
> I’m also surprised you see those long logging pauses because all the
> other JavaThreads were already processed in the handshake so something
> must be executing. Maybe you can switch the logging to trace instead
> of debug. Also maybe you can add logging at Java code so you can
> follow what is actually running? You could send it to the same app.txt
> and flush every time you write to see it ordered in the log. You can
> also get the uptime you see in the UL logs with
> ManagementFactory.getRuntimeMXBean().getUptime(). That logging will
> affect timing but might give some insights.
>
> Thanks,
> Patricio
>> Miklos
>>
>> On 4/28/20 1:02 AM, David Holmes wrote:
>>> Hi Miklos,
>>>
>>> On 28/04/2020 7:26 am, Karakó Miklós wrote:
>>>> Hello,
>>>>
>>>> We bumped into a possible Thread-Local Handshakes issue with
>>>> multiple apps. It seems that enabled TLH slows down applications
>>>> running on (although rare) single CPU boxes. I would be grateful if
>>>> you could confirm that this is a known trade-off for these rare
>>>> setups or a possible JVM bug. That would save us at least a few
>>>> hours of debugging.
>>>>
>>>> TL;DR: Both tested apps stop frequently around
>>>> "HandshakeOneThread", "HandshakeAllThreads" and "Revoked bias of
>>>> currently-unlocked object" running with
>>>> OpenJDK12/OpenJDK13/OpenJDK14. OpenJDK 13 with
>>>> -XX:-ThreadLocalHandshakes fixes the issue. Adding a second CPU to
>>>> the virtual machine fixes the issue. Enabling hyper-threading fixes
>>>> the issue.
>>>
>>> Can you try with latest JDK 15 build, just to see if this may be
>>> something already addressed?
>>>
>>> Thanks,
>>> David
>>>
>>>> More details are available at StackOverflow:
>>>> https://stackoverflow.com/questions/61375565/slow-application-frequent-jvm-hangs-with-single-cpu-setups-and-java-12
>>>>
>>>>
>>>> All thoughts are welcome.
>>>>
>>>> Best,
>>>> Miklos
>>>>
>>
>
More information about the hotspot-dev
mailing list