RFR (S): Single-element buffer in thread-local taskqueues

Tue Nov 8 22:01:07 UTC 2016

Hi,

In our work-stealing strategy, we are falling victim to yet another
non-optimality in division strategy. The best way to do parallel work is
to fork out the tasks, submit everything but one task to the queue,
recursively execute the remaining task. Pushing the last task is
redundant, because we will pop it out on the next cycle. (Actually, as
array stride example taught us, this can even set us up for thrashing
the queue, if somebody stole the task under our feet).

Now, rewriting the closure-heavy code to this pattern would be not easy,
but we can solve this in TaskQueues themselves, enter here:
  http://cr.openjdk.java.net/~shade/shenandoah/taskqueue-buff/webrev.01
 (includes the clean_queues fix, to be pushed separately)

It does the single-entry buffer before the queue, which acts almost like
the local variable for us to pull from. It does not make sense to buffer
more than one task, because it may impede work stealing. Buffering a
single task is completely safe, because we are definitely are going back
for it, and nobody else should steal.

Testing: hotspot_gc_shenandoah, jcstress-all (quick), microbenchmarks

Trims down the mark times for full GC (again, conc GC mark would also
benefit from this, once we solve other bottlenecks there).

== Baseline

Full GC Times         = (avg =  7344.99 ms)
  Mark                = (avg =  2802.24 ms)
    Drain Queues      = (avg =  2800.25 ms)
    Weak References   = (avg =     0.27 ms)
    Class Unloading   = (avg =     0.82 ms)
  Calculate Addresses = (avg =   824.15 ms)
  Adjust Pointers     = (avg =  1813.09 ms)
  Copy Objects        = (avg =  1889.87 ms)

== Patched

Full GC Times         = (avg =  6856.22 ms)
  Mark                = (avg =  2338.00 ms)
    Drain Queues      = (avg =  2336.00 ms)
    Weak References   = (avg =     0.27 ms)
    Class Unloading   = (avg =     0.80 ms)
  Calculate Addresses = (avg =   819.28 ms)
  Adjust Pointers     = (avg =  1806.19 ms)
  Copy Objects        = (avg =  1877.28 ms)

Good to go?

Thanks,
-Aleksey