Thread-Local Handshakes slowdown on single-cpu machines

Wed Apr 29 13:04:11 UTC 2020

Hi Robbin,

On 29/04/2020 5:45 pm, Robbin Ehn wrote:
> Hi,
> 
> On 2020-04-29 07:31, David Holmes wrote:
>> On 29/04/2020 1:26 pm, Karakó Miklós wrote:
>>> Hi Patricio,
>>>
>>> Yes, it seems like a scheduling issue for us too. One of my 
>>> colleagues spotted the following sentence in Robbin Ehn's Handshaking 
>>> HotSpot presentation (https://youtu.be/VBCOfAJ409s?t=2222): ~"if the 
>>> VM thread finds that all threads are unsafe, it keeps spinning"
>>> Could it be that actually the VM thread gets more CPU time and 
>>> actually blocks JavaThread?
>>
> 
> At the moment handshake do not use yield or sleep.
> http://hg.openjdk.java.net/jdk/jdk/file/ddb12e77e8dd/src/hotspot/share/runtime/handshake.cpp 
> 
> Do while loop L184 -> L201.
> 
> I have tested at least three different approaches trying to ease off the
> CPU, all off them suffered from performance regression on MP machines.
> We have internally seen this on very rare occasions on MP (we do not
> have UPs). Since this have been so rare, not trivial to fix without
> performance regression and I was under the assumption that UP were
> basically dead :), it have not been prioritized.

So I have to wonder why we did not use the same VMThread back_off() 
strategy with handshakes?

Cheers,
David
-----

> To just fix the UP case we could add something like:
> diff -r f7919010c7c2 src/hotspot/share/runtime/handshake.cpp
> --- a/src/hotspot/share/runtime/handshake.cpp   Tue Apr 28 09:14:17 2020 
> +0200
> +++ b/src/hotspot/share/runtime/handshake.cpp   Wed Apr 29 09:36:42 2020 
> +0200
> @@ -200,2 +200,5 @@
>         }
> +      if (!os::is_MP()) {
> +        os::naked_yield();
> +      }
>       } while (!_op->is_completed());
> 
> Since no thread can progress (assuming that one iteration is much
> shorter than a time-slice, this is extremely unlikely) there is no need
> for a second iteration until we executed something else.
> 
> And should be trivial to backport to JDK 11.
> 
> Thanks, Robbin
> 
> 
> 
>> The "spinning" is intended to be done in a scheduler-friendly way via:
>>
>> void os::naked_short_nanosleep(jlong ns) {
>>    assert(ns > -1 && ns < NANOUNITS, "Un-interruptable sleep, short 
>> time use only");
>>
>>    int64_t start = os::javaTimeNanos();
>>    do {
>>      if (SwitchToThread() == 0) {
>>        // Nothing else is ready to run on this cpu, spin a little
>>        SpinPause();
>>      }
>>    } while (os::javaTimeNanos() - start < ns);
>> }
>>
>> but this won't be that scheduler-friendly because it places the 
>> VMThread at the head of the priority queue, so it will cede up-to a 
>> timeslice to another thread, then run another iteration of the loop 
>> and hence alternate between a JavaThread and the VMThread until they 
>> are all blocked at the safepoint. What is really wanted is for the 
>> VMThread to go to the back of the scheduling queue.
>>
>> David
>> -----
>>
>>> I did not except that but I could reproduce the issue with a simple 
>>> hello world:
>>>
>>> public class Main {
>>>      public static void main(String[] args) {
>>>          System.out.println("Hello world");
>>>      }
>>> }
>>>
>>> A few measurements on Windows 10 x64 in VirtualBox:
>>>
>>>
>>> PS Z:\main> Measure-Command { .\main-1cpu.bat }
>>>
>>> Minutes           : 0
>>> Seconds           : 7
>>> Milliseconds      : 20
>>> Ticks             : 70203455
>>> TotalMinutes      : 0.117005758333333
>>> TotalSeconds      : 7.0203455
>>> TotalMilliseconds : 7020.3455
>>>
>>>
>>>
>>> PS Z:\main> Measure-Command { .\main-full.bat }
>>>
>>> Minutes           : 0
>>> Seconds           : 1
>>> Milliseconds      : 575
>>> Ticks             : 15751352
>>> TotalSeconds      : 1.5751352
>>> TotalMilliseconds : 1575.1352
>>>
>>>
>>>
>>> PS Z:\main> Measure-Command { .\main-full.bat }
>>>
>>> Minutes           : 0
>>> Seconds           : 1
>>> Milliseconds      : 558
>>> Ticks             : 15585384
>>> TotalSeconds      : 1.5585384
>>> TotalMilliseconds : 1558.5384
>>>
>>>
>>>
>>> PS Z:\main> Measure-Command { .\main-1cpu.bat }
>>>
>>> Minutes           : 0
>>> Seconds           : 23
>>> Milliseconds      : 648
>>> Ticks             : 236482685
>>> TotalSeconds      : 23.6482685
>>> TotalMilliseconds : 23648.2685
>>>
>>>
>>> main-1cpu.bat:
>>>
>>> c:\windows\system32\cmd.exe /C start /affinity 1 \jdk-14\bin\java 
>>> -Xlog:all=trace:file=app-1cpu.txt:uptime,tid,level,tags:filecount=50 
>>> Main
>>>
>>> main-full.bat:
>>>
>>> c:\windows\system32\cmd.exe /C start /affinity FF \jdk-14\bin\java 
>>> -Xlog:all=trace:file=app-full.txt:uptime,tid,level,tags:filecount=50 
>>> Main
>>>
>>> The files are also available here:
>>> https://palacsint.hu/etc/jdk/jdk14-single-cpu-slowdown-reprod-poc/
>>> Or in a zip:
>>> https://palacsint.hu/etc/jdk/jdk14-single-cpu-slowdown-reprod-poc.zip
>>>
>>> You can find trace level logs of three testruns in the run1-2-3 
>>> directories as well.
>>>
>>> Best,
>>> Miklos
>>>
>>>
>>> On 4/29/20 2:18 AM, Patricio Chilano wrote:
>>>> Hi Miklos,
>>>>
>>>> On 4/28/20 8:04 AM, Karakó Miklós wrote:
>>>>> Hello David,
>>>>>
>>>>> I've tried with OpenJDK 15 EA Build 20. Unfortunately it's slow as 
>>>>> with OpenJDK14.
>>>>> A colleague of mine set CPU affinity with taskset to only one CPU 
>>>>> on Linux and did not experience the slowdown. So the issue seems 
>>>>> Windows-related.
>>>> I wouldn’t rule out this being a scheduling issue. With thread-local 
>>>> handshakes some operations like the ones you see in your logs 
>>>> (deoptimization, biasedlocking revocation) do not need all 
>>>> JavaThreads to be stopped at the same time in order to be executed. 
>>>> If you only have one CPU but many JavaThreads running, depending on 
>>>> the scheduler, maybe it could happen that the JavaThread you need to 
>>>> handshake is constantly being left behind in the scheduler queue and 
>>>> so the operation takes longer to execute. When you add 
>>>> -XX:-ThreadLocalHandshakes, those operations will be executed using 
>>>> safepoints, which means all JavaThreads are stopped. So even if the 
>>>> scheduler decides to give priority to other JavaThreads those will 
>>>> block and free the CPU. Also when adding more CPU that will increase 
>>>> the likelihood of a JavaThread getting scheduled and might explain 
>>>> why you see it gets fixed.
>>>>
>>>> Alternatively the JavaThread that needs to be handshaked is being 
>>>> scheduled but is not polling for the handshake. But I think in that 
>>>> the case you should still have the same issue with 
>>>> -XX:-ThreadLocalHandshakes because the polling mechanism is the 
>>>> same, although it's true that the misbehaving JavaThread will have 
>>>> more CPU time to complete whatever is doing while others are stopped 
>>>> when safepointing.
>>>>
>>>> One thing that makes me question the scheduler theory is the fact 
>>>> that you see such improvements when only disabling biased-locking in 
>>>> 13.0.2 because that was still using safepoints back then. Can you 
>>>> check the logs to see in which operations you see the pauses when 
>>>> running with default options compared to -XX:-UseBiasedLocking?
>>>>
>>>> I’m also surprised you see those long logging pauses because all the 
>>>> other JavaThreads were already processed in the handshake so 
>>>> something must be executing. Maybe you can switch the logging to 
>>>> trace instead of debug. Also maybe you can add logging at Java code 
>>>> so you can follow what is actually running? You could send it to the 
>>>> same app.txt and flush every time you write to see it ordered in the 
>>>> log. You can also get the uptime you see in the UL logs with 
>>>> ManagementFactory.getRuntimeMXBean().getUptime(). That logging will 
>>>> affect timing but might give some insights.
>>>>
>>>> Thanks,
>>>> Patricio
>>>>> Miklos
>>>>>
>>>>> On 4/28/20 1:02 AM, David Holmes wrote:
>>>>>> Hi Miklos,
>>>>>>
>>>>>> On 28/04/2020 7:26 am, Karakó Miklós wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> We bumped into a possible Thread-Local Handshakes issue with 
>>>>>>> multiple apps. It seems that enabled TLH slows down applications 
>>>>>>> running on (although rare) single CPU boxes. I would be grateful 
>>>>>>> if you could confirm that this is a known trade-off for these 
>>>>>>> rare setups or a possible JVM bug. That would save us at least a 
>>>>>>> few hours of debugging.
>>>>>>>
>>>>>>> TL;DR: Both tested apps stop frequently around 
>>>>>>> "HandshakeOneThread", "HandshakeAllThreads" and "Revoked bias of 
>>>>>>> currently-unlocked object" running with 
>>>>>>> OpenJDK12/OpenJDK13/OpenJDK14. OpenJDK 13 with 
>>>>>>> -XX:-ThreadLocalHandshakes fixes the issue. Adding a second CPU 
>>>>>>> to the virtual machine fixes the issue. Enabling hyper-threading 
>>>>>>> fixes the issue.
>>>>>>
>>>>>> Can you try with latest JDK 15 build, just to see if this may be 
>>>>>> something already addressed?
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>>> More details are available at StackOverflow: 
>>>>>>> https://stackoverflow.com/questions/61375565/slow-application-frequent-jvm-hangs-with-single-cpu-setups-and-java-12 
>>>>>>>
>>>>>>>
>>>>>>> All thoughts are welcome.
>>>>>>>
>>>>>>> Best,
>>>>>>> Miklos
>>>>>>>
>>>>>
>>>>
>>>