RFR (M): Optimize object/array marking with bit-stealing task encoding

Mon Jan 16 15:46:20 UTC 2017

Hi,

Our mark stack contains ObjArrayFromToTask instances, which is are the tuples
<oop, from, to>. For arrays, from/to are describing the chunk to process. For
objects, from is always -1, indicating no chunk is expected.

Since HS taskqueue employs copying constructors to poll/push the tasks from/to
the queue, this means we always copy from/to fields, and the queue footprint
also always includes from/to fields. This is excessive for a prevailing case of
regular oop marking. This is an attempt to improve the case for regular oops,
without regressing parallel array processing:
  http://cr.openjdk.java.net/~shade/shenandoah/mark-objtask-regular/webrev.02/

This patch improves concurrent mark times significantly for regular oops:

retain.Tree -p size=50000000:

 Baseline: Concurrent Marking =    99.17 s (a =   826446 us) (n =   120)
             (lvls, us =   806641,   826172,   839844,   841797,   887344)

  Patched: Concurrent Marking =    93.77 s (a =   774975 us) (n =   121)
             (lvls, us =   753906,   771484,   785156,   787109,   837818)

...and also ever-so-slightly improving for object arrays:

retain.RefArray -p size=2000000000:

 Baseline: Concurrent Marking =   157.29 s (a =   741921 us) (n =   212)
             (lvls, us =   720703,   740234,   753906,   755859,   822552)

  Patched: Concurrent Marking =   158.64 s (a =   734448 us) (n =   216)
             (lvls, us =   720703,   734375,   744141,   746094,   764200)

Less targeted workloads also improve concurrent mark times, e.g. Compiler.compiler:

 Baseline: Concurrent Marking =     3.87 s (a =   168337 us) (n =    23)
             (lvls, us =    93750,   103516,   154297,   232422,   439476)

  Patched: Concurrent Marking =     2.53 s (a =   120386 us) (n =    21)
             (lvls, us =    76953,    93164,   103516,   125000,   400385)

Testing: hotspot_gc_shenandoah, jcstress tests-all.

Thanks,
-Aleksey