Thread-Local Handshakes slowdown on single-cpu machines

Wed Apr 29 13:32:18 UTC 2020

Hi David,

>> I have tested at least three different approaches trying to ease off the
>> CPU, all off them suffered from performance regression on MP machines.
>> We have internally seen this on very rare occasions on MP (we do not
>> have UPs). Since this have been so rare, not trivial to fix without
>> performance regression and I was under the assumption that UP were
>> basically dead :), it have not been prioritized.
> 
> So I have to wonder why we did not use the same VMThread back_off() 
> strategy with handshakes?

That is one of the variants I have tested.
I think the main problem is that we do not consider if we did any 
progress or not. Also a big difference is if VM thread backs off the CPU
during a handshake, a thread that have already completed the handshake
may run instead of the VM thread.

I'm coding a draft of an algorithm that takes into account if we did any
progress or not. That involves knowning for each thread why the
operation have not been executed. So any backporting will not be
trivial.

 From what I recall the performance regression was mostly startup.
In startup +- BiasedLocking and early deoptimize play major roll if and
how many handshakes we need to perform.

I'll create a jira issue for this.

What you think about just fixing UP first with something simple like below?

Btw I think code should be:
if (!os::is_MP() && !_op->is_completed()) {
   os::naked_yield();
}
And it's also needed one single target thread case.

Thanks, Robbin

> 
> Cheers,
> David
> -----
> 
>> To just fix the UP case we could add something like:
>> diff -r f7919010c7c2 src/hotspot/share/runtime/handshake.cpp
>> --- a/src/hotspot/share/runtime/handshake.cpp   Tue Apr 28 09:14:17 
>> 2020 +0200
>> +++ b/src/hotspot/share/runtime/handshake.cpp   Wed Apr 29 09:36:42 
>> 2020 +0200
>> @@ -200,2 +200,5 @@
>>         }
>> +      if (!os::is_MP()) {
>> +        os::naked_yield();
>> +      }
>>       } while (!_op->is_completed());
>>
>> Since no thread can progress (assuming that one iteration is much
>> shorter than a time-slice, this is extremely unlikely) there is no need
>> for a second iteration until we executed something else.
>>
>> And should be trivial to backport to JDK 11.
>>
>> Thanks, Robbin
>>
>>
>>
>>> The "spinning" is intended to be done in a scheduler-friendly way via:
>>>
>>> void os::naked_short_nanosleep(jlong ns) {
>>>    assert(ns > -1 && ns < NANOUNITS, "Un-interruptable sleep, short 
>>> time use only");
>>>
>>>    int64_t start = os::javaTimeNanos();
>>>    do {
>>>      if (SwitchToThread() == 0) {
>>>        // Nothing else is ready to run on this cpu, spin a little
>>>        SpinPause();
>>>      }
>>>    } while (os::javaTimeNanos() - start < ns);
>>> }
>>>
>>> but this won't be that scheduler-friendly because it places the 
>>> VMThread at the head of the priority queue, so it will cede up-to a 
>>> timeslice to another thread, then run another iteration of the loop 
>>> and hence alternate between a JavaThread and the VMThread until they 
>>> are all blocked at the safepoint. What is really wanted is for the 
>>> VMThread to go to the back of the scheduling queue.
>>>
>>> David
>>> -----
>>>
>>>> I did not except that but I could reproduce the issue with a simple 
>>>> hello world:
>>>>
>>>> public class Main {
>>>>      public static void main(String[] args) {
>>>>          System.out.println("Hello world");
>>>>      }
>>>> }
>>>>
>>>> A few measurements on Windows 10 x64 in VirtualBox:
>>>>
>>>>
>>>> PS Z:\main> Measure-Command { .\main-1cpu.bat }
>>>>
>>>> Minutes           : 0
>>>> Seconds           : 7
>>>> Milliseconds      : 20
>>>> Ticks             : 70203455
>>>> TotalMinutes      : 0.117005758333333
>>>> TotalSeconds      : 7.0203455
>>>> TotalMilliseconds : 7020.3455
>>>>
>>>>
>>>>
>>>> PS Z:\main> Measure-Command { .\main-full.bat }
>>>>
>>>> Minutes           : 0
>>>> Seconds           : 1
>>>> Milliseconds      : 575
>>>> Ticks             : 15751352
>>>> TotalSeconds      : 1.5751352
>>>> TotalMilliseconds : 1575.1352
>>>>
>>>>
>>>>
>>>> PS Z:\main> Measure-Command { .\main-full.bat }
>>>>
>>>> Minutes           : 0
>>>> Seconds           : 1
>>>> Milliseconds      : 558
>>>> Ticks             : 15585384
>>>> TotalSeconds      : 1.5585384
>>>> TotalMilliseconds : 1558.5384
>>>>
>>>>
>>>>
>>>> PS Z:\main> Measure-Command { .\main-1cpu.bat }
>>>>
>>>> Minutes           : 0
>>>> Seconds           : 23
>>>> Milliseconds      : 648
>>>> Ticks             : 236482685
>>>> TotalSeconds      : 23.6482685
>>>> TotalMilliseconds : 23648.2685
>>>>
>>>>
>>>> main-1cpu.bat:
>>>>
>>>> c:\windows\system32\cmd.exe /C start /affinity 1 \jdk-14\bin\java 
>>>> -Xlog:all=trace:file=app-1cpu.txt:uptime,tid,level,tags:filecount=50 
>>>> Main
>>>>
>>>> main-full.bat:
>>>>
>>>> c:\windows\system32\cmd.exe /C start /affinity FF \jdk-14\bin\java 
>>>> -Xlog:all=trace:file=app-full.txt:uptime,tid,level,tags:filecount=50 
>>>> Main
>>>>
>>>> The files are also available here:
>>>> https://palacsint.hu/etc/jdk/jdk14-single-cpu-slowdown-reprod-poc/
>>>> Or in a zip:
>>>> https://palacsint.hu/etc/jdk/jdk14-single-cpu-slowdown-reprod-poc.zip
>>>>
>>>> You can find trace level logs of three testruns in the run1-2-3 
>>>> directories as well.
>>>>
>>>> Best,
>>>> Miklos
>>>>
>>>>
>>>> On 4/29/20 2:18 AM, Patricio Chilano wrote:
>>>>> Hi Miklos,
>>>>>
>>>>> On 4/28/20 8:04 AM, Karakó Miklós wrote:
>>>>>> Hello David,
>>>>>>
>>>>>> I've tried with OpenJDK 15 EA Build 20. Unfortunately it's slow as 
>>>>>> with OpenJDK14.
>>>>>> A colleague of mine set CPU affinity with taskset to only one CPU 
>>>>>> on Linux and did not experience the slowdown. So the issue seems 
>>>>>> Windows-related.
>>>>> I wouldn’t rule out this being a scheduling issue. With 
>>>>> thread-local handshakes some operations like the ones you see in 
>>>>> your logs (deoptimization, biasedlocking revocation) do not need 
>>>>> all JavaThreads to be stopped at the same time in order to be 
>>>>> executed. If you only have one CPU but many JavaThreads running, 
>>>>> depending on the scheduler, maybe it could happen that the 
>>>>> JavaThread you need to handshake is constantly being left behind in 
>>>>> the scheduler queue and so the operation takes longer to execute. 
>>>>> When you add -XX:-ThreadLocalHandshakes, those operations will be 
>>>>> executed using safepoints, which means all JavaThreads are stopped. 
>>>>> So even if the scheduler decides to give priority to other 
>>>>> JavaThreads those will block and free the CPU. Also when adding 
>>>>> more CPU that will increase the likelihood of a JavaThread getting 
>>>>> scheduled and might explain why you see it gets fixed.
>>>>>
>>>>> Alternatively the JavaThread that needs to be handshaked is being 
>>>>> scheduled but is not polling for the handshake. But I think in that 
>>>>> the case you should still have the same issue with 
>>>>> -XX:-ThreadLocalHandshakes because the polling mechanism is the 
>>>>> same, although it's true that the misbehaving JavaThread will have 
>>>>> more CPU time to complete whatever is doing while others are 
>>>>> stopped when safepointing.
>>>>>
>>>>> One thing that makes me question the scheduler theory is the fact 
>>>>> that you see such improvements when only disabling biased-locking 
>>>>> in 13.0.2 because that was still using safepoints back then. Can 
>>>>> you check the logs to see in which operations you see the pauses 
>>>>> when running with default options compared to -XX:-UseBiasedLocking?
>>>>>
>>>>> I’m also surprised you see those long logging pauses because all 
>>>>> the other JavaThreads were already processed in the handshake so 
>>>>> something must be executing. Maybe you can switch the logging to 
>>>>> trace instead of debug. Also maybe you can add logging at Java code 
>>>>> so you can follow what is actually running? You could send it to 
>>>>> the same app.txt and flush every time you write to see it ordered 
>>>>> in the log. You can also get the uptime you see in the UL logs with 
>>>>> ManagementFactory.getRuntimeMXBean().getUptime(). That logging will 
>>>>> affect timing but might give some insights.
>>>>>
>>>>> Thanks,
>>>>> Patricio
>>>>>> Miklos
>>>>>>
>>>>>> On 4/28/20 1:02 AM, David Holmes wrote:
>>>>>>> Hi Miklos,
>>>>>>>
>>>>>>> On 28/04/2020 7:26 am, Karakó Miklós wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> We bumped into a possible Thread-Local Handshakes issue with 
>>>>>>>> multiple apps. It seems that enabled TLH slows down applications 
>>>>>>>> running on (although rare) single CPU boxes. I would be grateful 
>>>>>>>> if you could confirm that this is a known trade-off for these 
>>>>>>>> rare setups or a possible JVM bug. That would save us at least a 
>>>>>>>> few hours of debugging.
>>>>>>>>
>>>>>>>> TL;DR: Both tested apps stop frequently around 
>>>>>>>> "HandshakeOneThread", "HandshakeAllThreads" and "Revoked bias of 
>>>>>>>> currently-unlocked object" running with 
>>>>>>>> OpenJDK12/OpenJDK13/OpenJDK14. OpenJDK 13 with 
>>>>>>>> -XX:-ThreadLocalHandshakes fixes the issue. Adding a second CPU 
>>>>>>>> to the virtual machine fixes the issue. Enabling hyper-threading 
>>>>>>>> fixes the issue.
>>>>>>>
>>>>>>> Can you try with latest JDK 15 build, just to see if this may be 
>>>>>>> something already addressed?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>>> More details are available at StackOverflow: 
>>>>>>>> https://stackoverflow.com/questions/61375565/slow-application-frequent-jvm-hangs-with-single-cpu-setups-and-java-12 
>>>>>>>>
>>>>>>>>
>>>>>>>> All thoughts are welcome.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Miklos
>>>>>>>>
>>>>>>
>>>>>
>>>>