From tom.deneau at amd.com  Mon Sep 16 10:01:34 2013
From: tom.deneau at amd.com (Deneau, Tom)
Date: Mon, 16 Sep 2013 17:01:34 +0000
Subject: Question about mark words in allocated objects
Message-ID: <BC97738F8E7C8742BABED7F06FB9DF9149A7BDB2@SATLEXDAG01.amd.com>

We are experimenting with an HSA device doing object allocation.

In our prototype, there is one or more idle Java threads acting as a "donor threads", in that each donates its TLAB.
In the simple case below, there is only one donor thread.

Since there can be more HSA device workitems than there are donor threads, the  workitems use atomic operations to bump the tlab.top pointer of a donor thread.  The usual graal code then takes over to fill in the contents of the allocated item, including the mark word, class pointer, etc.
In our usual junit test cases, we run the "kernel function" sequentially on the CPU  and then as an HSA kernel on the GPU and compare results.  This is a "--vm server" run so the CPU side is not going thru graal

In my test case, I am using a JNI function to print out the full object contents including the header after the "kernel" completes.  The atomic pointer bumps seem to work correctly but I've noticed a slight difference in the header contents.

Since we are running with the default UseBiasedLocking enabled, I can see that when the object is initialized from the "prototype mark word", it is initialized with the value 5 (I assume this means anonymously biased?).

I have noticed that when we run normal sequential Java and print out the mark word after the kernel has run, the mark word has a value of 1 (unlocked).  But if the kernel has run on the GPU and we print out the mark word, it is still 5 in each object.  The rest of the object contents matches between the CPU and GPU Runs.

Where does the mark word get changed from 5 to 1 on the cpu side?

-- Tom


From doug.simon at oracle.com  Mon Sep 16 11:35:21 2013
From: doug.simon at oracle.com (Doug Simon)
Date: Mon, 16 Sep 2013 20:35:21 +0200
Subject: Question about mark words in allocated objects
In-Reply-To: <BC97738F8E7C8742BABED7F06FB9DF9149A7BDB2@SATLEXDAG01.amd.com>
References: <BC97738F8E7C8742BABED7F06FB9DF9149A7BDB2@SATLEXDAG01.amd.com>
Message-ID: <CEC6BEDC-DD2C-4B00-A709-1EDEB38702BB@oracle.com>


On Sep 16, 2013, at 7:01 PM, "Deneau, Tom" <tom.deneau at amd.com> wrote:

> We are experimenting with an HSA device doing object allocation.
> 
> In our prototype, there is one or more idle Java threads acting as a "donor threads", in that each donates its TLAB.
> In the simple case below, there is only one donor thread.
> 
> Since there can be more HSA device workitems than there are donor threads, the  workitems use atomic operations to bump the tlab.top pointer of a donor thread.  The usual graal code then takes over to fill in the contents of the allocated item, including the mark word, class pointer, etc.
> In our usual junit test cases, we run the "kernel function" sequentially on the CPU  and then as an HSA kernel on the GPU and compare results.  This is a "--vm server" run so the CPU side is not going thru graal
> 
> In my test case, I am using a JNI function to print out the full object contents including the header after the "kernel" completes.  The atomic pointer bumps seem to work correctly but I've noticed a slight difference in the header contents.
> 
> Since we are running with the default UseBiasedLocking enabled, I can see that when the object is initialized from the "prototype mark word", it is initialized with the value 5 (I assume this means anonymously biased?).

It means biasable but unlocked (see HotSpotReplacementsUtil.biasedLockPattern() and its usage in MonitorSnippets).

> I have noticed that when we run normal sequential Java and print out the mark word after the kernel has run, the mark word has a value of 1 (unlocked).  But if the kernel has run on the GPU and we print out the mark word, it is still 5 in each object.  The rest of the object contents matches between the CPU and GPU Runs.
> 
> Where does the mark word get changed from 5 to 1 on the cpu side?

This is almost certainly due to the delayed initialization of biased locking - see BiasedLocking::init (in biasedLocking.cpp). Try adding -XX:BiasedLockingStartupDelay=0 as a VM option.

-Doug

From tom.deneau at amd.com  Mon Sep 16 11:39:37 2013
From: tom.deneau at amd.com (Deneau, Tom)
Date: Mon, 16 Sep 2013 18:39:37 +0000
Subject: Question about mark words in allocated objects
In-Reply-To: <CEC6BEDC-DD2C-4B00-A709-1EDEB38702BB@oracle.com>
References: <BC97738F8E7C8742BABED7F06FB9DF9149A7BDB2@SATLEXDAG01.amd.com>
	<CEC6BEDC-DD2C-4B00-A709-1EDEB38702BB@oracle.com>
Message-ID: <BC97738F8E7C8742BABED7F06FB9DF9149A7BF22@SATLEXDAG01.amd.com>

Doug --

Yes, that was it.

-- Tom


-----Original Message-----
From: Doug Simon [mailto:doug.simon at oracle.com] 
Sent: Monday, September 16, 2013 1:35 PM
To: Deneau, Tom
Cc: graal-dev at openjdk.java.net; sumatra-dev at openjdk.java.net
Subject: Re: Question about mark words in allocated objects


On Sep 16, 2013, at 7:01 PM, "Deneau, Tom" <tom.deneau at amd.com> wrote:

> We are experimenting with an HSA device doing object allocation.
> 
> In our prototype, there is one or more idle Java threads acting as a "donor threads", in that each donates its TLAB.
> In the simple case below, there is only one donor thread.
> 
> Since there can be more HSA device workitems than there are donor threads, the  workitems use atomic operations to bump the tlab.top pointer of a donor thread.  The usual graal code then takes over to fill in the contents of the allocated item, including the mark word, class pointer, etc.
> In our usual junit test cases, we run the "kernel function" sequentially on the CPU  and then as an HSA kernel on the GPU and compare results.  This is a "--vm server" run so the CPU side is not going thru graal
> 
> In my test case, I am using a JNI function to print out the full object contents including the header after the "kernel" completes.  The atomic pointer bumps seem to work correctly but I've noticed a slight difference in the header contents.
> 
> Since we are running with the default UseBiasedLocking enabled, I can see that when the object is initialized from the "prototype mark word", it is initialized with the value 5 (I assume this means anonymously biased?).

It means biasable but unlocked (see HotSpotReplacementsUtil.biasedLockPattern() and its usage in MonitorSnippets).

> I have noticed that when we run normal sequential Java and print out the mark word after the kernel has run, the mark word has a value of 1 (unlocked).  But if the kernel has run on the GPU and we print out the mark word, it is still 5 in each object.  The rest of the object contents matches between the CPU and GPU Runs.
> 
> Where does the mark word get changed from 5 to 1 on the cpu side?

This is almost certainly due to the delayed initialization of biased locking - see BiasedLocking::init (in biasedLocking.cpp). Try adding -XX:BiasedLockingStartupDelay=0 as a VM option.

-Doug