RFR #2 (M) 8148146: Integrate new internal Unsafe entry points, and basic intrinsic support for VarHandles

Tue Feb 16 21:29:04 UTC 2016

Hi Aleksey,

On 15/02/16 14:13, Andrew Dinn wrote:
> It would be very good to run your Unsafe tests. However, they may
> still succeed but minus the desired optimizations. So, I'll apply the
> patch to my tree and check the generated code by eyeball.

I ran the tests with Roland's patch and yours combined and they all
passed but I am afraid that's not the full story. The AArch64
optimizations to use stlr for a volatile put and ldar for a volatile get
were not performed (the CAS optimzation is still working). What is more
the change you have made is actually causing invalid code to be
generated for the get operation.

This does not relate to Roland's patch. The same result will happen
with the code as it was prior to Roland's patch.

>> The changes are supposed to generate the same code for old Unsafe 
>> methods -- the refactoring shuffles the compiler code around, but
>> the sequence of accesses/barriers should stay the same. Eyeballing
>> x86_64 assembly indeed shows it is the same, but I haven't looked
>> beyond x86.

Unfortunately, the graphs are not quite the same and that affects the
generated code on AArch64 even though it has no visible effect on x86.
The critical difference is that for volatile puts and gets you have
omitted to insert the MemBarRelease and MemBarAcquire nodes.

Note that, unlike x86, AArch64 relies on these barriers to control
whether or not memory barrier instructions are inserted. Thats true
whether or not the optimization rules are used (by setting
-X::+/-UseBarriersForVolatile). If the optimization to use stlr and ldar
is switched off (UseBarriersForVolatile=true) then MemBarRelease and
MemBarAcquire are translated to the dmb instructions which enforce
memory synchronzation. If the optimization is switched on then they are
must still be present in order to detect cases where the dmb can be
elided and an stlr or ldar used instead. So, unfortunately, your change
breaks AArch64 with or without the ldar/stlr optimization scheme.

Details of what is wrong and a potential fix are included below.

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in UK and Wales under Company Registration No. 3798903
Directors: Michael Cunningham (US), Michael O'Neill (Ireland), Paul
Argiry (US)


Your changes to the graph layout are as follows:

Ignoring GC barriers, the old code generated the following subgraph for
an inlined Unsafe.putObjectVolatile

A)
  MemBarRelease
  MemBarCPUOrder
  StoreX[mo_release]
  MemBarVolatile

Your new code generates this subgraph

B)
  MemBarCPUOrder
  StoreX[mo_release]
  MemBarVolatile

Subgraph A is consistent with the following subgraph generated from a
Java source level assignment of a volatile field

C)
  MemBarRelease
  StoreX[mo_release]
  MemBarVolatile

whereas subgraph B is not consistent with C

Similarly, for an inlined volatile get the old code generated this subgraph

A)
 MemBarCPUOrder
      ||       \\
 MemBarAcquire LoadX[mo_acquire]
 MemBarCPUOrder

(the double bars represent control and memory links)

Your new code generates the following subgraph

B)
  MemBarCPUOrder
      ||        \\
  MemBarCPUOrder LoadX[mo_acquire]


By contrast the subgraph generated from a Java source level read of a
volatile field is

C)
  LoadX[mo_acquire]
  MemBarAcquire

or the minor variant

D)
  LoadX[mo_acquire]
  DecodeN
  MemBarAcquire

The first change explains why the put tests pass but the optimization
fails. The AArch64 predicates which optimize volatile puts look for a
StoreX bracketed by a MemBarRelease and MemBarVolatile (they ignore any
intervening MemBarCPUOrder). Without the optimization this gets
translated to

  dmb ish {Release}
  str
  dmb ish {Volatile}

With the optimization generation of both dmbs is inhibited and it gets
translated to

  stlr

With your patch the optimizaton fails to apply and the result is

  str
  dmb ish {Volatile}

which just happens to work but only because the non-optimized code is
less than optimal -- in the absence of knowledge as to where the Release
or Volatile membars have come from it has to generate two dmb
instructions for a volatile put.

Now, I could probably tweak the put optimization to recognize your put
subgraph B. However, I'm not at all sure that would be valid. I believe
that the presence of the MemBarRelease or MemBarAcquire is important
because there may potentially be other circumstances where a subgraph of
type B might arise.

One might simply assume that put subgraph B is fine i.e. the presence of
a MemBarCPUOrder and MemBarVolatile bracketing a releasing store is all
that is needed to say "yes this is a volatile put". Well, the first
gotcha is that the is_release() test will actually return true for
certain stores to non-volatile fields. Puts in constructors can be
marked as releasing. The second gotcha is that a CPUOrder membar might
just happen to turn up as the tail of some other subgraph and a Volatile
membar might also appear as head of a third subgraph. So, if they both
happened to appear either side of the store you have a false positive
for a volatile put.

With the get subgraph the problem is more serious. I'm not sure why the
tests pass since a required dmb instruction is missing.

If my get optimization is switched off both graphs A and C (also variant
D) get translated to

  ldr
  dmb ish

The load is generated as a normal ldr and the Acquire membar is
converted to a dmb ish without attempting to see if it can be elided.

If the get optimization is enabled the predicate for the LoadX rules
detects subgraph A or C (or D) and generates ldar.

Similarly, the predicate for the MemBarAcquire rule detects these same 3
subgraphs and elides the dmb. The problem with your subgraph B is that
it contains no MemBarAcquire. So the LoadX translates to ldr but nothing
triggers the MemBarAcquire rule and no dmb ish is emitted. We end up
with a plain

  ldr

which does not enforce the required memory ordering.

If you think there is a good reason to change the graph shape the way
you have done then I'd be interested to understand your reasoning. It
seems reasonable to me for each of the missing memory barriers to be
present. I would have said that a volatile put should be releasing and a
volatile get should be acquiring so the generated subgraphs should
include a MemBarRelease and MemBarAcquire, respectively. If this was
just an oversight then the small patch below, when applied over your
posted patch, restores both missing barriers (I added 16 lines of
context to locate it for you) and also restores the AArch64 optimization.

n.b. luckily (or so it would seem given my current possibly misguided
understanding of your plans) it turns out that the leading MemBarRelease
is not critical as an element of the CAS signature. That's because a CAS
is only ever generated by inlining. So, we can be sure that any leading
CPUOrder membar (or indeed Release membar if that occurs instead) is
only there because it was put there by the inliner. I think that's
important for implementing the various flavours of CAS. If we want a CAS
which omits either the Release or Acquire semantics then we should be
able to omit the leading Release MemBar or tariling Acquire membar (so
long as we retain the CPUOrder membars). I'm hoping this fact is going
to allow us to implement the optimized CAS variants on AArch64 at least.


----- 8< -------- 8< -------- 8< -------- 8< -------- 8< -------- 8< ---
diff -r e51ab854285b src/share/vm/opto/library_call.cpp

--- a/src/share/vm/opto/library_call.cpp	Tue Feb 16 06:47:18 2016 -0500
+++ b/src/share/vm/opto/library_call.cpp	Tue Feb 16 12:46:37 2016 -0500
@@ -2498,32 +2498,35 @@
   // We need to emit leading and trailing CPU membars (see below) in
   // addition to memory membars for special access modes. This is a little
   // too strong, but avoids the need to insert per-alias-type
   // volatile membars (for stores; compare Parse::do_put_xxx), which
   // we cannot do effectively here because we probably only have a
   // rough approximation of type.

   switch(kind) {
     case Relaxed:
     case Opaque:
     case Acquire:
       break;
     case Release:
       insert_mem_bar(Op_MemBarRelease);
       break;
     case Volatile:
+      if (is_store) {
+	insert_mem_bar(Op_MemBarRelease);
+      }
       if (!is_store && support_IRIW_for_not_multiple_copy_atomic_cpu) {
         insert_mem_bar(Op_MemBarVolatile);
       }
       break;
     default:
       ShouldNotReachHere();
   }

   // Memory barrier to prevent normal and 'unsafe' accesses from
   // bypassing each other.  Happens after null checks, so the
   // exception paths do not take memory state from the memory barrier,
   // so there's no problems making a strong assert about mixing users
   // of safe & unsafe memory.
   if (need_mem_bar) insert_mem_bar(Op_MemBarCPUOrder);

   assert(alias_type->adr_type() == TypeRawPtr::BOTTOM ||
alias_type->adr_type() == TypeOopPtr::BOTTOM ||
@@ -2636,32 +2639,35 @@
         // Final sync IdealKit and GraphKit.
         final_sync(ideal);
 #undef __
       }
     }
   }

   switch(kind) {
     case Relaxed:
     case Opaque:
     case Release:
       break;
     case Acquire:
       insert_mem_bar(Op_MemBarAcquire);
       break;
     case Volatile:
+      if (!is_store) {
+	insert_mem_bar(Op_MemBarAcquire);
+      }
       if (is_store && !support_IRIW_for_not_multiple_copy_atomic_cpu) {
         insert_mem_bar(Op_MemBarVolatile);
       }
       break;
     default:
       ShouldNotReachHere();
   }

   if (need_mem_bar) insert_mem_bar(Op_MemBarCPUOrder);

   return true;
 }

 //----------------------------inline_unsafe_load_store----------------------------
 // This method serves a couple of different customers (depending on
LoadStoreKind):
 //