RFR: 8250597: G1: Improve inlining around trim_queue

Wed Aug 5 00:26:30 UTC 2020

Please review this change to G1ParScanThreadState to improve inlining
management.  This change doesn't seem to make any measurable performance
difference by itself.  However, without this, other changes being developed
may change gcc's inlining decisions in ways that often hurt performance,
mostly by heuristically cutting off inlining.

The primary inlining boundary is changed to trim_queue_to_threshold, which
is the main driver loop for task processing. This is instead of having the
boundary at copy_to_survivor_space, which is called per-task.

gcc was not performing some of the requested inlining either before or after
this refactoring.  In order to get the requested inlining we're now using
gcc's "flatten" function attribute (via the new ATTRIBUTE_FLATTEN macro).
(The flatten attribute is also supported by clang.  Unfortunately, I didn't
find a similar feature for MSVC.  Perhaps #pragma inline_depth might help?)
This has been applied to the functions on the desired inlining boundary.
This approach is preferable to marking all called functions FORCEINLINE, or
guessing at and maintaining changes to the inlining heuristic parameters.

This change also adds NOINLINE attributes to some slow-path functions
within that inlining boundary, to control code size expansion.

Having moved trim_queue_to_threshold to the .cpp file, a number of (inline)
functions that were in the .inline.hpp file are now only reachable from the
.cpp file, so have also been moved to the .cpp file, though still declared
inline.

Rather than moving the inline copy_to_survivor_space to the .inline.hpp
file, we now have a public not-inline version (for use by existing closure
clients) and a private inline version, used by the public function and used
internally by the queue processing.

The combination of flatten and noinline attributes gets us the inlining we
are asking for.  With this change, once inside trim_queue_to_threshold, the
fast path for task processing is largely inlined.  And it should stay that
way even as we make other changes in this area.

To make the review easier, this change has been broken into two parts.

- open.00 mostly consists of moving code between files, with very few
changes to the function bodies.
-- do_oop_evac now calls the new inline do_copy_to_survivor_space.
-- trim_queue_to_threshold contains the entire processing loop, rather
   than callers checking for completion after calling it.
-- trim_queue and trim_queue_partially assert requested trimming is
   complete, rather than checking and looping.
Also defines and uses ATTRIBUTE_FLATTEN.

- open.01.inc does some refactoring to cut off inlining on some slow paths.

- open.01 is the complete change.

CR:
https://bugs.openjdk.java.net/browse/JDK-8250597

Webrev:
https://cr.openjdk.java.net/~kbarrett/8250597/open.00/
https://cr.openjdk.java.net/~kbarrett/8250597/open.01/
https://cr.openjdk.java.net/~kbarrett/8250597/open.01.inc/

Testing:
mach5 tier1-5
various performance tests, finding no significant changes.