Graal generated assembly much larger than C2

Tue Jun 4 19:18:03 UTC 2019

I think what you're seeing is the variable cost of the G1 barrier, 
triggered by the fact that the Graal compiler runs in the same Java heap 
as your test case.  The G1 post barrier has a couple tests that allow it 
to skip the barrier, namely if both objects are in the same region or if 
the object being stored into is in the new generation.  These are very 
cheap tests.  However if those fail then it will issue a STORE_LOAD 
memory barrier and check whether the card is dirty.  This is much slower 
and seems to be what occurs here with Graal.  This is likely because the 
benchmark state is allocated once and no allocation is performed after 
that by the benchmark.  Graal performs a lot of compilation and probably 
pushes that object out of the new gen.

If you run with -XX:+UseParallelGC you'll see that Graal and C2 are 
largely the same since the standard card mark is constant cost.  You 
should watch out for the GC selection since if you compare 8 and 9+ 
you're also comparing different barriers.  Using libgraal will also make 
Graal behave more like C2 since it's no longer using the Java heap but 
we don't have an officially supported build of libgraal for jdk13 yet.

You could probably mitigate this by using the Setup and Level 
annotations to recreate the benchmark state on each iteration.  Anyway 
doing micro level measurements of Object stores with G1 is going to be 
very sensitive to actual object placement.

tom

Carl M wrote on 5/27/19 12:45 PM:
> Hi Thomas,
> 
> I retried using 13-ea+22 (previously 11.0.2)   and found that while the generated code is smaller, a lot of the Flags I passed don't seem to work anymore.  The bytecode is not interleaved in the generated assembly any more (I don't know which flag sets this) and the -XX:PrintAssemblyOptions=intel flag no longer seems to work.  I also tried this on a non-Graal benchmark run, and it had the same problem.  Perhaps something wrong in 13?
> 
> The code that was produced was a little smaller (364 -> 346 bytes).  The generated (un-annotated assembly) is here:
> https://gist.github.com/carl-mastrangelo/7f1def204ae5d6e76268cc42b8c1af45
> 
> I haven't tried limiting the benchmark to just setOpaque, since I can't see the bytecode interleaved (to confirm it's doing what I think it is).
> 
> 
>> On May 27, 2019 at 9:09 AM Thomas Wuerthinger <thomas.wuerthinger at oracle.com> wrote:
>>
>>
>> Hi Carl,
>>
>> Can you further reduce this test case to one VarHandle#setOpaque call?
>>
>> It is likely that there is a wrong intrinsification of either VarHandle#setOpaque, VarHandle#setRelease, or VarHandle#storeStoreFence. Not much to do for the compiler overall here as the code only consists of special case intrinsics.
>>
>> Also, does this also run on the latest JDK13 EA? The version of the compiler in JDK11 is not the latest one.
>>
>>
>> * thomas
>>
>>
>>> On 27 May 2019, at 00:21, Carl M <java at rkive.org> wrote:
>>>
>>> Hi,
>>>
>>> I am new to trying out Graal, and found that it was much slower for
>>> a trivial benchmark I wrote. I was surprised that it was about 40%
>>> slower, so I dumped the assembly and found it was much larger for
>>> the Graal version. I collected a dump of the assembly, but I am not
>>> much of a compiler person so I don't know what to do with it.
>>>
>>> Is this the right list to see if there is something wrong?
>>>
>>>
>>> Assembly of each compiler (and JVM flags):
>>> https://gist.github.com/carl-mastrangelo/5fe2f1f744c05ca20242cad9b5f7fb26
>>>
>>>
>>> Source code inner loop:
>>> https://github.com/carl-mastrangelo/perfmark/blob/b266d9494054b5a2d09bc62f7b234d693e2d5173/java9/src/main/java/io/perfmark/java9/VarHandleMarkHolder.java#L129