RFR: 8252505: C1/C2 compiler support for blackholes [v11]

Tue Dec 1 12:55:01 UTC 2020

On Tue, 1 Dec 2020 08:57:10 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> JMH uses the [`Blackhole::consume`](https://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-core/src/main/java/org/openjdk/jmh/infra/Blackhole.java#l153) methods to avoid dead-code elimination of the code that produces benchmark values. It now relies on producing opaque side-effects and breaking inlining. While it was proved useful for many years, it unfortunately comes with several major drawbacks:
>>  
>>   1. Call costs dominate nanobenchmarks. On TR 3970X, the call cost is several nanoseconds.
>>   2. The work spent in Blackhole.consume dominates nanobenchmarks too. It takes about a nanosecond on TR 3970X.
>>   3. Argument preparation for call makes different argument types behave differently. This is prominent on architectures where calling conventions for passing e.g. floating-point arguments require elaborate dance.
>> 
>> Supporting this directly in compilers would improve nanobenchmark fidelity.
>> 
>> Instead of introducing public APIs or special-casing JMH methods in JVM, we can hook a new command to compiler control, and let JMH sign up its Blackhole methods for it with `-XX:CompileCommand=blackhole,org.openjdk.jmh.infra.Blackhole::consume`. This is being prototyped as [CODETOOLS-7902762](https://bugs.openjdk.java.net/browse/CODETOOLS-7902762). It makes Blackholes behave [substantially better](http://cr.openjdk.java.net/~shade/8252505/bh-old-vs-new.png).
>> 
>> Current prototype is for initial approach review and early testing. I am open for suggestions how to make it simpler; not that I haven't tried, but it is likely there is something I am overlooking here.
>> 
>> C1 code is platform-independent, and it adds the new node which is then lowered to nothing.
>> 
>> C2 code is more complicated. There were four attempts to implement this, and what you see in the PR is the fourth attempt.
>> 
>> [First attempt](http://cr.openjdk.java.net/~shade/8252505/webrev.01/) was to introduce fake store like `StoreV` ("store void"), and then lower them to nothing. It runs into a series of funky problems: you would like to have at least two shapes of the store to match the store type width not to confuse the optimizer, or even have the whole mirror of `Store*` hierarchy. Additionally, optimizer tweaks were needed. The awkward problem of GC barrier verification showed up: if `StoreV*` is a subclass of `Store*`, then verificators rightfully expect GC barriers before them. If we emit GC, then we need to handle walking over `StoreV*` nodes in optimizers. 
>> 
>> [Second attempt](http://cr.openjdk.java.net/~shade/8252505/webrev.04/) was to introduce the special `Blackhole` node that consumes the values -- basically like C1 implementation does it. Alas, the many attempts to make sure the new node is not DCE'd failed. Roland looked at it, and suggested that there seems to be no way to model the effects we are after: consume the value, but have no observable side effects. So, suggested we instead put the boolean flag onto `CallJavaNode`, and then match it to nothing in `.ad`. 
>> 
>> ...which is the essence of the third attempt. Drag the blackhole through C2 as if it has call-like side effects, and then emit nothing. Instead of boolean flag, the subsequent iteration introduced a full new `CallBlackhole` node, that is a call as far as optimizers are concerned, and then it is matched to nothing in `.ad`. This seems to require the least fiddling with C2 internals.
>
> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Cleanup BlackholeCallGenerator

Yeah, it shapes really well! �� Looks much better now.

> The only suggestion I failed to follow up on is constructing `MachCallBlackhole` directly in `Matcher`, thus trying to avoid `.ad` changes. I could not find a good place in `Matcher` to do that. I think having a single match rule per `.ad` is still tolerable.

`Matcher::match_sfpt()` has call-specific logic for different shapes. Having a special case for Blackhole which turns `CallBlackhole` into `MachCallBlackhole` fits it well. 

Since every AD instruction has a dedicated class, you can get rid of some additional changes if you customize matching logic for blackhole.

Some minor comments follow.

src/hotspot/share/c1/c1_GraphBuilder.cpp line 3469:

> 3467:   Values* args = state()->pop_arguments(callee->arg_size());
> 3468: 
> 3469:   // Blackhole everything except the receiver itself

Does it make sense to limit blackhole candidates to static methods only? 
Otherwise, you end up with a null check in the generated code.

src/hotspot/share/c1/c1_InstructionPrinter.cpp line 865:

> 863: 
> 864: void InstructionPrinter::do_Blackhole(Blackhole* x) {
> 865:   if (x->v() != NULL) {

Should be an assert instead? (LIRGenerator::do_Blackhole doesn't expect NULL.)

src/hotspot/share/opto/doCall.cpp line 115:

> 113: 
> 114:   // Try blackholing a method
> 115:   if (callee->is_loaded() &&

The check can be turned into a property on `ciMethod` and the result cached there.

src/hotspot/cpu/aarch64/aarch64.ad line 1778:

> 1776: }
> 1777: 
> 1778: int MachCallBlackholeNode::ret_addr_offset() {

It can be moved into shared code (`machnode.cpp`).

src/hotspot/share/c1/c1_GraphBuilder.cpp line 2032:

> 2030:   if (target->is_loaded() && target->return_type()->basic_type() == T_VOID &&
> 2031:       compilation()->directive()->should_blackhole(target)) {
> 2032:     if (try_blackhole(target)) {

Since `try_blackhole` can't fail, why do the check at all?

src/hotspot/share/opto/machnode.cpp line 817:

> 815: void MachCallBlackholeNode::dump_spec(outputStream *st) const {
> 816:   st->print("Blackhole ");
> 817:   MachCallNode::dump_spec(st);

How does the output look like? It would be nice to have the info what are the values being consumed and where they are located.

src/hotspot/share/opto/callnode.hpp line 780:

> 778:   virtual uint size_of() const; // Size is bigger
> 779: public:
> 780:   CallBlackholeNode( const TypeFunc *tf , address addr) : CallNode(tf, addr, TypePtr::BOTTOM) {

`addr` can be omitted since it is always `NULL`.

-------------

PR: https://git.openjdk.java.net/jdk/pull/1203