webrev to extend hsail allocation to allow gpu to refill tlab

Tue Jun 3 21:27:22 UTC 2014

I have placed a webrev up at 
  http://cr.openjdk.java.net/~tdeneau/graal-webrevs/webrev-hsail-refill-tlab-gpu 
which we would like to get checked into the graal trunk.

This webrev extends the existing hsail heap allocation logic.  In the
existing logic, when a workitem cannot allocate from the current tlab,
it just deoptimizes.  In this webrev, we add logic to allocate a new
tlab from the gpu.

The algorithm must deal with the fact that multiple hsa workitems can
share a single tlab, and so multiple workitems can "overflow".  A
workitem can tell if it is the "first overflower" and the first
overflower is charged with getting a new tlab while the other
workitems wait for the new tlab to be announced.

Workitems access a tlab thru a fixed register (sort of like a thread
register) which instead of pointing to a donor thread now points to a
HSAILTlabInfo structure, which is sort of a subset of a full tlab
struct, containing just the fields that we would actually use on the
gpu.

   struct HSAILTlabInfo {
      HeapWord *  _start;                 // normal vm tlab fields, start, top, end, etc.
      HeapWord *  _top;
      HeapWord *  _end;
      // additional data not in a normal tlab
      HeapWord * _lastGoodTop;            // first overflower records this
      JavaThread * _donor_thread;         // donor thread associated with this tlabInfo
   }

The first overflower grabs a new tlabInfo structure and allocates a
new tlab (using edenAllocate) and "publishes" the new tlabInfo for
other workitems to start using.  See the routine called
allocateFromTlabSlowPath in HSAILNewObjectSnippets.

Eventually when hsail function calls are supported, this slow path
will not be inlined but will be called as a stub.

Other changes:

   * the allocation-related logic was removed from gpu_hsail.cpp into
     gpu_hsail_tlab.hpp.  The HSAILHotSpotNmethod now keeps track of
     whether a kernel uses allocation and avoids this logic if it does
     not.

      * Before the kernel runs, the donor thread tlabs are used to set
        up the initial tlabInfo records, and a tlab allocation is done
        here if the donor thread tlab is empty.

      * When kernel is finished running, the cpu side will see a list
        of one or more HSAILTlabInfos and basically postprocesses
        these, fixing up any overflows and making them parsable and
        copying them back to the appropriate donor thread as needed.

   * the inter-workitem communication required the use of the hsail
     instructions for load_acquire and store_release from the
     snippets.  The HSAILDirectLoadAcquireNode and
     HSAILDirectStoreReleaseNode with associated NodeIntrinsics were
     created to handle this.  A node for creating a workitemabsid
     instruction was also created, it is not used in the algorithm as
     such but was useful for adding debug traces.

   * In HSAILHotSpotBackend, the logic to decide whether a kernel uses
     allocation or not was made more precise.  (This flag is also made
     available at execute time.)  There were several atomic_add
     related tests were falsely being marked as requiring
     HSAILAllocation and thus HSAILDeoptimization support. This
     marking was removed.

-- Tom Deneau