First cut at a card table for Shenandoah

Mon Jul 27 23:08:57 UTC 2020

Charlie,

This is highly appreciated. You pinpointed the mistake I made, not checking all facets of inheritance here. And yes, Op_CastP2X is implicated in the super class. Great progress. I'll check other inheritance avenues.

Many thanks!

Bernd

On 7/27/20, 2:58 PM, "Charlie Gracie" <Charlie.Gracie at microsoft.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

    Hi Bernd,

    I applied your patch locally to play around and with a release build I was getting
    some wild performance results which were not consistent from one run to the
    next. When I ran with a fastdebug build I get this assertion 100% of the time
    running some DeCapo benchmarks:

    #  Internal Error (../../src/hotspot/share/opto/node.cpp:268), pid=28283, tid=16131
    #  assert((int)num_edges > 0) failed: need non-zero edge count for loop progress

    When I ran with -XX:-EliminateAllocations the assertion went away and as you mentioned
    performance stabilized. Looking at your code changes I noticed you made
    ShenandoahBarrierSetC2 a subclass of CardTableBarrierSetC2. When an object is scalar
    replaced (-XX:+EliminateAllocations) the GC barriers that happen directly on the object
    are removed by the `eliminate_gc_barrier` calls. ShenandoahBarrierSetC2 already had
    an implementation of `eliminate_gc_barrier` so the super class implementation in
    CardTableBarrierSetC2 is being missed. I modified the Shenandoah impl as follows
    which resolved the performance and assertion issues for me.

    void ShenandoahBarrierSetC2::eliminate_gc_barrier(PhaseMacroExpand* macro, Node* n) const {
      if (is_shenandoah_wb_pre_call(n)) {
        shenandoah_eliminate_wb_pre(n, &macro->igvn());
      }
      if (n->Opcode() == Op_CastP2X) {
        CardTableBarrierSetC2::eliminate_gc_barrier(macro, n);
      }
    }

    I believe a few other APIs would need to also check with the super class implementation but
    for my runs to complete successfully this was the only change I needed to make.

    Cheers,
    Charlie Gracie

    On 2020-07-27, 2:02 PM, "shenandoah-dev on behalf of Mathiske, Bernd" <shenandoah-dev-retn at openjdk.java.net on behalf of mathiske at amazon.com> wrote:

        Aditya, Thomas, Roman,

        Thank you for providing these hints, which were helpful to rule out possible root causes!
        Looking at all this and at some initial profiling results,
        Volker Simonis suggested that -XX:-EliminateAllocations might help. And it does!
        When I use this flag, performance is "back to normal" in the short benchmark runs I have conducted so far.
        I'll run some more extensive tests, with repetitions, and report some numbers, soon.

        Bernd

        On 7/23/20, 4:32 AM, "shenandoah-dev on behalf of Roman Kennke" <shenandoah-dev-retn at openjdk.java.net on behalf of rkennke at redhat.com> wrote:

            CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

            On Thu, 2020-07-23 at 12:31 +0200, Thomas Schatzl wrote:
            > Hi,
            >
            > On 22.07.20 22:59, Roman Kennke wrote:
            > > I am not very familiar with all this stuff.
            > >
            > > You should check if the C2 optimizations for card-table-barriers
            > > kick
            > > in. IIRC, there was something that elides those barriers on stores
            > > into
            > > new objects altogether, which make up the majority of stores.
            > >
            >
            >    if you are talking about eliding write barriers for new objects
            > because they are "always" allocated in young gen, and no
            > generational
            > collector is interested in young->old references, there is no such
            > thing
            > afaik.
            >
            > No collector guarantees this "always" property: e.g. CMS may
            > directly
            > decide to put new objects into old gen for a few reasons, and for
            > parallel (and g1) it e.g. can happen that a gc right after
            > allocating
            > that object (when e.g. transitioning from native slow-path code)
            > will
            > move that object into old gen. Or simply when the object is large.
            >
            > See e.g. https://bugs.openjdk.java.net/browse/JDK-8191342
            >
            > That would still require the compiler to only apply that optimization
            > if
            > it can prove that the object is "small enough" to fit into young gen
            > in
            > any case (it is probably easy to get conservative enough values for
            > that
            > from somewhere).
            >

            Thanks Thomas for clarification! :-)

            Roman