From tom.deneau at amd.com Mon Sep 16 10:01:34 2013 From: tom.deneau at amd.com (Deneau, Tom) Date: Mon, 16 Sep 2013 17:01:34 +0000 Subject: Question about mark words in allocated objects Message-ID: We are experimenting with an HSA device doing object allocation. In our prototype, there is one or more idle Java threads acting as a "donor threads", in that each donates its TLAB. In the simple case below, there is only one donor thread. Since there can be more HSA device workitems than there are donor threads, the workitems use atomic operations to bump the tlab.top pointer of a donor thread. The usual graal code then takes over to fill in the contents of the allocated item, including the mark word, class pointer, etc. In our usual junit test cases, we run the "kernel function" sequentially on the CPU and then as an HSA kernel on the GPU and compare results. This is a "--vm server" run so the CPU side is not going thru graal In my test case, I am using a JNI function to print out the full object contents including the header after the "kernel" completes. The atomic pointer bumps seem to work correctly but I've noticed a slight difference in the header contents. Since we are running with the default UseBiasedLocking enabled, I can see that when the object is initialized from the "prototype mark word", it is initialized with the value 5 (I assume this means anonymously biased?). I have noticed that when we run normal sequential Java and print out the mark word after the kernel has run, the mark word has a value of 1 (unlocked). But if the kernel has run on the GPU and we print out the mark word, it is still 5 in each object. The rest of the object contents matches between the CPU and GPU Runs. Where does the mark word get changed from 5 to 1 on the cpu side? -- Tom From doug.simon at oracle.com Mon Sep 16 11:35:21 2013 From: doug.simon at oracle.com (Doug Simon) Date: Mon, 16 Sep 2013 20:35:21 +0200 Subject: Question about mark words in allocated objects In-Reply-To: References: Message-ID: On Sep 16, 2013, at 7:01 PM, "Deneau, Tom" wrote: > We are experimenting with an HSA device doing object allocation. > > In our prototype, there is one or more idle Java threads acting as a "donor threads", in that each donates its TLAB. > In the simple case below, there is only one donor thread. > > Since there can be more HSA device workitems than there are donor threads, the workitems use atomic operations to bump the tlab.top pointer of a donor thread. The usual graal code then takes over to fill in the contents of the allocated item, including the mark word, class pointer, etc. > In our usual junit test cases, we run the "kernel function" sequentially on the CPU and then as an HSA kernel on the GPU and compare results. This is a "--vm server" run so the CPU side is not going thru graal > > In my test case, I am using a JNI function to print out the full object contents including the header after the "kernel" completes. The atomic pointer bumps seem to work correctly but I've noticed a slight difference in the header contents. > > Since we are running with the default UseBiasedLocking enabled, I can see that when the object is initialized from the "prototype mark word", it is initialized with the value 5 (I assume this means anonymously biased?). It means biasable but unlocked (see HotSpotReplacementsUtil.biasedLockPattern() and its usage in MonitorSnippets). > I have noticed that when we run normal sequential Java and print out the mark word after the kernel has run, the mark word has a value of 1 (unlocked). But if the kernel has run on the GPU and we print out the mark word, it is still 5 in each object. The rest of the object contents matches between the CPU and GPU Runs. > > Where does the mark word get changed from 5 to 1 on the cpu side? This is almost certainly due to the delayed initialization of biased locking - see BiasedLocking::init (in biasedLocking.cpp). Try adding -XX:BiasedLockingStartupDelay=0 as a VM option. -Doug From tom.deneau at amd.com Mon Sep 16 11:39:37 2013 From: tom.deneau at amd.com (Deneau, Tom) Date: Mon, 16 Sep 2013 18:39:37 +0000 Subject: Question about mark words in allocated objects In-Reply-To: References: Message-ID: Doug -- Yes, that was it. -- Tom -----Original Message----- From: Doug Simon [mailto:doug.simon at oracle.com] Sent: Monday, September 16, 2013 1:35 PM To: Deneau, Tom Cc: graal-dev at openjdk.java.net; sumatra-dev at openjdk.java.net Subject: Re: Question about mark words in allocated objects On Sep 16, 2013, at 7:01 PM, "Deneau, Tom" wrote: > We are experimenting with an HSA device doing object allocation. > > In our prototype, there is one or more idle Java threads acting as a "donor threads", in that each donates its TLAB. > In the simple case below, there is only one donor thread. > > Since there can be more HSA device workitems than there are donor threads, the workitems use atomic operations to bump the tlab.top pointer of a donor thread. The usual graal code then takes over to fill in the contents of the allocated item, including the mark word, class pointer, etc. > In our usual junit test cases, we run the "kernel function" sequentially on the CPU and then as an HSA kernel on the GPU and compare results. This is a "--vm server" run so the CPU side is not going thru graal > > In my test case, I am using a JNI function to print out the full object contents including the header after the "kernel" completes. The atomic pointer bumps seem to work correctly but I've noticed a slight difference in the header contents. > > Since we are running with the default UseBiasedLocking enabled, I can see that when the object is initialized from the "prototype mark word", it is initialized with the value 5 (I assume this means anonymously biased?). It means biasable but unlocked (see HotSpotReplacementsUtil.biasedLockPattern() and its usage in MonitorSnippets). > I have noticed that when we run normal sequential Java and print out the mark word after the kernel has run, the mark word has a value of 1 (unlocked). But if the kernel has run on the GPU and we print out the mark word, it is still 5 in each object. The rest of the object contents matches between the CPU and GPU Runs. > > Where does the mark word get changed from 5 to 1 on the cpu side? This is almost certainly due to the delayed initialization of biased locking - see BiasedLocking::init (in biasedLocking.cpp). Try adding -XX:BiasedLockingStartupDelay=0 as a VM option. -Doug