[aarch64-port-dev ] RFR: 8216350: AArch64: monitor unlock fast path not called

Derek White derekw at marvell.com
Wed Jan 9 02:19:19 UTC 2019


Hi Nick,

Very nice find!

My only comments are to fix up some comments (pre-existing), and some trivial cleanups of pre-existing code. These are judgement calls, and it would be good to get the approval of at least one Andrew.

Comments:
1) The TODO comment before aarch64_enc_fast_unlock() has been done since 2014, so it can go away.

2) In aarch64_enc_fast_lock() and aarch64_enc_fast_unlock(), there are three comment blocks referring to old code using cmpxchgptr. At this point in time I find the new code clearer, and these comments don't add much?

Cleanup suggestions (untested!):
3) In aarch64_enc_fast_lock():
    // we can use AArch64's bit test and branch here but
    // markoopDesc does not define a bit index just the bit value
    // so assert in case the bit pos changes
#   define __monitor_value_log2 1
    assert(markOopDesc::monitor_value == (1 << __monitor_value_log2), "incorrect bit position");
    __ tbnz(disp_hdr, __monitor_value_log2, object_has_monitor);
#   undef __monitor_value_log2

Can be replaced with:
     __ tbnz(disp_hdr, exact_log2(markOopDesc::monitor_value), object_has_monitor);
It looks like this was fixed in several places a long time ago, but this one got missed.

4)  Slightly better comment for last instruction of fast_unlock (and explicitly use zr).
    __ stlr(zr, tmp); // set unowned

- Derek


--------------------- Patch on original code (not your patch, sorry!) -----------------------------
--- src/hotspot/cpu/aarch64/aarch64.ad
+++ src/hotspot/cpu/aarch64/aarch64.ad
@@ -3418,13 +3418,7 @@
     }
 
     // Handle existing monitor
-    // we can use AArch64's bit test and branch here but
-    // markoopDesc does not define a bit index just the bit value
-    // so assert in case the bit pos changes
-#   define __monitor_value_log2 1
-    assert(markOopDesc::monitor_value == (1 << __monitor_value_log2), "incorrect bit position");
-    __ tbnz(disp_hdr, __monitor_value_log2, object_has_monitor);
-#   undef __monitor_value_log2
+    __ tbnz(disp_hdr, exact_log2(markOopDesc::monitor_value), object_has_monitor);
 
     // Set displaced_header to be (markOop of object | UNLOCK_VALUE).
     __ orr(disp_hdr, disp_hdr, markOopDesc::unlocked_value);
@@ -3455,14 +3449,6 @@
       __ b(retry_load);
     }
 
-    // Formerly:
-    // __ cmpxchgptr(/*oldv=*/disp_hdr,
-    //               /*newv=*/box,
-    //               /*addr=*/oop,
-    //               /*tmp=*/tmp,
-    //               cont,
-    //               /*fail*/NULL);
-
     assert(oopDesc::mark_offset_in_bytes() == 0, "offset of _mark is not 0");
 
     // If the compare-and-exchange succeeded, then we found an unlocked
@@ -3511,15 +3497,6 @@
       __ bind(fail);
     }
 
-    // Label next;
-    // __ cmpxchgptr(/*oldv=*/disp_hdr,
-    //               /*newv=*/rthread,
-    //               /*addr=*/tmp,
-    //               /*tmp=*/rscratch1,
-    //               /*succeed*/next,
-    //               /*fail*/NULL);
-    // __ bind(next);
-
     // store a non-null value into the box.
     __ str(box, Address(box, BasicLock::displaced_header_offset_in_bytes()));
 
@@ -3544,9 +3521,6 @@
 
   %}
 
-  // TODO
-  // reimplement this with custom cmpxchgptr code
-  // which avoids some of the unnecessary branching
   enc_class aarch64_enc_fast_unlock(iRegP object, iRegP box, iRegP tmp, iRegP tmp2) %{
     MacroAssembler _masm(&cbuf);
     Register oop = as_Register($object$$reg);
@@ -3597,12 +3571,6 @@
         __ b(retry_load);
       }
 
-    // __ cmpxchgptr(/*compare_value=*/box,
-    //               /*exchange_value=*/disp_hdr,
-    //               /*where=*/oop,
-    //               /*result=*/tmp,
-    //               cont,
-    //               /*cas_failed*/NULL);
     assert(oopDesc::mark_offset_in_bytes() == 0, "offset of _mark is not 0");
 
     __ bind(cas_failed);
@@ -3626,7 +3594,7 @@
     __ cbnz(rscratch1, cont);
     // need a release store here
     __ lea(tmp, Address(tmp, ObjectMonitor::owner_offset_in_bytes()));
-    __ stlr(rscratch1, tmp); // rscratch1 is zero
+    __ stlr(zr, tmp); // set unowned
 
     __ bind(cont);
     // flag == EQ indicates success


> -----Original Message-----
> From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> On
> Behalf Of Nick Gasson (Arm Technology China)
> Sent: Tuesday, January 08, 2019 3:04 AM
> To: hotspot-compiler-dev at openjdk.java.net compiler <hotspot-compiler-
> dev at openjdk.java.net>
> Cc: nd <nd at arm.com>; aarch64-port-dev at openjdk.java.net
> Subject: [EXT] [aarch64-port-dev ] RFR: 8216350: AArch64: monitor unlock
> fast path not called
> 
> ----------------------------------------------------------------------
> Hi,
> 
> While looking at the profiling output of some micro-benchmarks for locking
> on AArch64, I noticed that the monitor unlock fast-path in
> aarch64_enc_fast_unlock in aarch64.ad (under label `object_has_monitor') is
> almost never executed, even though the lock in the test is inflated.
> 
> In order to branch to this fast-path we check if bit #1 is set in the displaced
> header word on the stack:
> 
>    __ tbnz(disp_hdr, exact_log2(markOopDesc::monitor_value),
> object_has_monitor);
> 
> But in the common case the value in the dhw is set by the monitor locking
> fast-path in aarch64_enc_fast_lock, where we use the pointer to the dhw as
> an arbitrary non-null value. But the lower three bits of this pointer will
> always be zero, and so won't trigger the unlock fast-path which is looking for
> bit #1 set, and we will fall through to call the runtime to unlock the monitor.
> 
>    // store a non-null value into the box.
>    __ str(box, Address(box, BasicLock::displaced_header_offset_in_bytes()));
> 
> It seems that the unlock fast-path will only be executed when the monitor
> was originally locked by the runtime (e.g. when the lock was first inflated),
> because ObjectSynchronizer::slow_enter will store
> markOopDesc::unused_mark into the dhw, and this value has bit #1 set.
> 
> Can someone help me review this change to aarch64_enc_fast_lock to use
> markOopDesc::unused_mark as the arbitrary non-null value rather than `box'
> to match ObjectSynchronizer::slow_enter?
> 
> Webrev: http://cr.openjdk.java.net/~njian/8216350/webrev.0/
> Bug: https://bugs.openjdk.java.net/browse/JDK-8216350
> 
> Also removed an unnecessary double branch in the unlock code.
> 
> Ran jtreg + jcstress.
> 
> I also added a new micro-benchmark to
> test/micro/org/openjdk/bench/vm/lang/LockUnlock.java so you can see this
> behaviour:
> 
> Without patch:
> 
> Result "org.openjdk.bench.vm.lang.LockUnlock.testContendedLock":
>    597.855 ?(99.9%) 73.183 ns/op [Average]
>    (min, avg, max) = (438.862, 597.855, 861.028), stdev = 97.697
>    CI (99.9%): [524.672, 671.038] (assumes normal distribution)
> 
> With patch:
> 
> Result "org.openjdk.bench.vm.lang.LockUnlock.testContendedLock":
>    219.067 ?(99.9%) 21.146 ns/op [Average]
>    (min, avg, max) = (176.379, 219.067, 300.186), stdev = 28.229
>    CI (99.9%): [197.921, 240.212] (assumes normal distribution)
> 
> This is with -XX:+UseLSE, -UseLSE has a similar improvement.
> 
> Thanks,
> Nick


More information about the hotspot-compiler-dev mailing list