RFR (XS): Optimize branch frequency of G1's write post-barrier in C2

Thu Jul 11 09:38:30 UTC 2019

Hi,

On Thu, 2019-06-13 at 22:13 -0700, Man Cao wrote:
> Hi all,
> Can I have reviews for this improvement for G1's write post-barrier?  
> More details are stated on the JBS page. Highlight: it reduces CPU-
> cost-per-query by 1% for Google search frontend's production
> workload.
> Webrev: https://cr.openjdk.java.net/~manc/8225776/webrev.00/
> RFE: https://bugs.openjdk.java.net/browse/JDK-8225776
> 
> Some notes:
> The RFE could be a duplicate of JDK-8130918. However, this patch does
> not improve the performance of the microbenchmark in JDK-8130918. I'm
> not sure if this patch fully addresses JDK-8130918.
> Chuck Rasbold helped me to figure out the proper fix for the basic
> block ordering by looking at the CFG before and after C2's
> PhaseBlockLayout. The out-most if branch (xor) has to have a
> frequency greater than 0.5 to make the  BBs laid out correctly. It is
> also more conventional to use PROB_LIKELY_MAG than PROB_LIKELY in C2.

- the changes do what you would expect, laying out the xor/null check
in the fast path and the rest in the slow path. I also think the use of
PROB_LIKELY_MAG is okay and equivalent to the original code.

CC'ing compiler-dev to look over it again just in case.

- can you share the code changes to generate the statistics? It would
be nice to confirm these on a few more applications and play around
with them a bit :)

I would like to confirm some very old numbers we have for other older
benchmarks that this is indeed the best probabibility distribution.
Particularly I do not understand that from these numbers we did not
change the probabilities as you suggested :( There were other changes
mostly related to barrier elision in that time frame, but it seems
likelihood changes were not attempted.

- these numbers (and yours) also indicate that the not-young check is
very likely to be not taken (i.e. you jump over the storeload). Did you
also perform some experiments changing the order a bit?

It might be detrimental for this particular case where the StoreLoad is
expensive, and the xor/non-null filter out at least some additional of
those, but maybe

if (young) -> exit
if (different-region) -> exit
if (non-null) -> exit
StoreLoad
...

may be better to do? I am aware that the "young" check adds a load,
which is also expensive (but not as much as the StoreLoad), but it
seems to be an interesting case to look at.

In our old results (as far as I can interpret them) it did not seem to
have any advantage/disadvantage, so I am just curious whether you did
such tests and their conclusion.

- internal (quick) perf testing showed no overall score changes, except
that maxJOPS on SpecJBB2015 seemed to improve by ~1.2% (only had time
for very few experiments at this time, will rerun, so there is some
chance that this has been a fluke) which is definitely nice.

So, looks good to me :)

Thanks,
  Thomas