RFR(M) 8150353: PPC64LE: Support RTM on linux

Tue Mar 8 11:33:01 UTC 2016

Hi Vladimir,

thanks for the explanation and for sponsoring the change.

We have noticed that jbb2005 benefits from both, RTM stack locking (with RTM deopt) and from Biased Locking so we thought it would be nice to have both. But UseRTMForStackLocks is still experimental. Seems like there are plenty of things people could play with in the future.

Best regards,
Martin

-----Original Message-----
From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] 
Sent: Montag, 7. März 2016 19:18
To: Doerr, Martin <martin.doerr at sap.com>; Gustavo Romero <gromero at linux.vnet.ibm.com>; hotspot-dev at openjdk.java.net
Cc: brenohl at br.ibm.com
Subject: Re: RFR(M) 8150353: PPC64LE: Support RTM on linux

RTM's assumption is: "RTM locking is most useful when there is high lock 
contention and low data contention. With high lock contention the lock 
is usually inflated and biased locking is not suitable for that case 
anyway." It is not the case with jbb2005. And that is why RTM is off by 
default.

First RTM implementation used BiasedLocking bits in object's markword. 
Later it was implemented differently but there was still a concern that 
to make them work together we may need more changes. We thought that we 
will do that as separate project later. But currently we don't have a 
plan for doing this.

Regards,
Vladimir

On 3/7/16 2:29 AM, Doerr, Martin wrote:
> Hi Vladimir,
>
> thank you very much for the detailed analysis.
> I hope an #ifdef PPC64 is ok in the shared code?
>
> I had written something to Gustavo about the performance problem we have with RTM in SPEC jbb2005:
>
>> The following issue is important for performance work:
>> RTM does not work with BiasedLocking. The latter gets switched off if RTM is activated which has a large performance impact (especially in jbb2005).
>> I would disable it for a reference measurement:
>> -XX:-UseBiasedLocking
>>
>> Unfortunately, RTM was slower than BiasedLocking but faster than the reference (without both) which tells me that there's room for improvement.
>> There are basically 3 classes of locks:
>> 1. no contention
>> 2. contention on lock, low contention on data
>> 3. high contention on data
>>
>> I believe the optimal treatment for the cases would be:
>> 1. Biased Locking
>> 2. Transactional Memory
>> 3. classical locking with lock inflating
>>
>> I think it would be good if the JVM could optimize for all these cases in the future. But that would add additional complexity and code size.
>
> Do you think this is something which should be improved in the future?
> We could try e.g. the following approach
> - try biased
> - deoptimize if it doesn't work well, try transactional
> - deoptimize if it doesn't work well, use classical locking (with inflating)
>
> Best regards,
> Martin
>
>
> -----Original Message-----
> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> Sent: Freitag, 26. Februar 2016 03:24
> To: Doerr, Martin <martin.doerr at sap.com>; Gustavo Romero <gromero at linux.vnet.ibm.com>; hotspot-dev at openjdk.java.net
> Cc: brenohl at br.ibm.com
> Subject: Re: RFR(M) 8150353: PPC64LE: Support RTM on linux
>
> The problem with increasing ScratchBufferBlob size is that with Tiered
> compilation we scale number of compiler threads based on cpu count and
> increase space in CodeCache accordingly:
>
>     code_buffers_size += c2_count * C2Compiler::initial_code_buffer_size();
>
> I did experiment on Intel setting ON all RTM flags which can increase
> size of lock code:
>
> $ java -XX:+UnlockExperimentalVMOptions -XX:+UnlockDiagnosticVMOptions
> -XX:+UseRTMLocking -XX:+UseRTMDeopt -XX:+UseRTMForStackLocks
> -XX:+PrintPreciseRTMLockingStatistics -XX:+PrintFlagsFinal -version
> |grep RTM
> Java HotSpot(TM) 64-Bit Server VM warning: UseRTMLocking is only
> available as experimental option on this platform.
>        bool PrintPreciseRTMLockingStatistics         := true
>                      {C2 diagnostic}
>        intx RTMAbortRatio                             = 50
>                      {ARCH experimental}
>        intx RTMAbortThreshold                         = 1000
>                      {ARCH experimental}
>        intx RTMLockingCalculationDelay                = 0
>                      {ARCH experimental}
>        intx RTMLockingThreshold                       = 10000
>                      {ARCH experimental}
>       uintx RTMRetryCount                             = 5
>                      {ARCH product}
>        intx RTMSpinLoopCount                          = 100
>                      {ARCH experimental}
>        intx RTMTotalCountIncrRate                     = 64
>                      {ARCH experimental}
>        bool UseRTMDeopt                              := true
>                      {ARCH product}
>        bool UseRTMForStackLocks                      := true
>                      {ARCH experimental}
>        bool UseRTMLocking                            := true
>                      {ARCH product}
>        bool UseRTMXendForLockBusy                     = true
>                      {ARCH experimental}
>
>
> I added next lines to the end of Compile::scratch_emit_size() method:
>
>     if (n->is_Mach() && n->as_Mach()->ideal_Opcode() == Op_FastLock) {
>       tty->print_cr("======== FastLock size:  %d  ==========",
> buf.total_content_size());
>     }
>     if (n->is_Mach() && n->as_Mach()->ideal_Opcode() == Op_FastUnlock) {
>       tty->print_cr("======== FastUnlock size:  %d  ==========",
> buf.total_content_size());
>     }
>
> and got:
>
> ======== FastLock size:  657  ==========
> ======== FastUnlock size:  175  ==========
>
> Thanks,
> Vladimir
>
> On 2/25/16 3:43 AM, Doerr, Martin wrote:
>> Hi Vladimir,
>>
>> thanks for taking a look.
>>
>> About version values:
>> We are using a similar scheme for version checks on AIX where we know that the version values are less than 256.
>> It makes comparisons much more convenient.
>> But I agree that we should double-check if it is guaranteed for linux as well (and possibly add an assertion).
>>
>> About scratch buffer size:
>> We only noticed that the scratch buffer was too small when we enable all RTM features:
>> -XX:+UnlockExperimentalVMOptions -XX:+UseRTMLocking -XX:+UseRTMForStackLocks -XX:+UseRTMDeopt
>> We have only tried on PPC64, but I wonder if the current size is sufficient for x86. I currently don't have access to a Skylake machine.
>>
>> I think adding 1024 bytes to the scratch buffer doesn't hurt.
>> (It may also lead to larger CodeBuffers in output.cpp but I don't think this is problematic as long as the real content gets copied to nmethods.)
>> Would you agree?
>>
>> Best regards,
>> Martin
>>
>> -----Original Message-----
>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>> Sent: Donnerstag, 25. Februar 2016 00:54
>> To: Gustavo Romero <gromero at linux.vnet.ibm.com>; Doerr, Martin <martin.doerr at sap.com>; hotspot-dev at openjdk.java.net
>> Cc: brenohl at br.ibm.com
>> Subject: Re: RFR(M) 8150353: PPC64LE: Support RTM on linux
>>
>> My concern (but I am not export) is Linux version encoding. Is it true
>> that each value in x.y.z is less then 256? Why not keep them as separate
>> int values?
>> I also thought we have OS versions in make files but we check only gcc
>> version there.
>>
>> Do you have problem with ScratchBufferBlob only on PPC or on some other
>> platforms too? May be we should make MAX_inst_size as platform specific
>> value.
>>
>> Thanks,
>> Vladimir
>>
>> On 2/24/16 11:50 AM, Gustavo Romero wrote:
>>> Hi Martin,
>>>
>>> Both little and big endian Linux kernel contain the syscall change, so
>>> I did not include:
>>>
>>> #if defined(COMPILER2) && (defined(AIX) || defined(VM_LITTLE_ENDIAN)
>>>
>>> in globalDefinitions_ppc.hpp.
>>>
>>> Please, could you review the following change?
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8150353
>>> Webrev (hotspot): http://81.de.7a9f.ip4.static.sl-reverse.com/webrev/
>>>
>>> Summary:
>>>
>>> * Enable RTM support for Linux on PPC64 (LE and BE).
>>> * Fix C2 compiler buffer size issue.
>>>
>>> Thank you.
>>>
>>> Regards,
>>> Gustavo
>>>