ObjectSynchronizer iterate only in-use monitors?

Wed May 10 20:41:33 UTC 2017

Hello,

I have a question related to ObjectSynchronizer. We (the Shenandoah GC
devs) found that for some programs, scanning ObjectSynchronizer roots
takes quite long. ObjectSynchronizer::oops_do() scans all the blocks in
gBlockList. As far as I understand, this contains all the monitor blocks
of all threads, both currently in-use and free blocks.

If I understand it correctly, it would be sufficient to scan only in-use
monitors. And since each thread has its own in-use list (at least with
MonitorInUseLists), it should be ok to scan that during each thread's
scan, plus one additional scan of the gOmInUseList.

I am writing here because I would like to get confirmation that what I'm
doing is sane, or if there are any pitfalls that I'm not aware of. The
webrev in question (against shenandoah/jdk9) is this:

http://cr.openjdk.java.net/~rkennke/fastsyncroots/webrev.00/
<http://cr.openjdk.java.net/%7Erkennke/fastsyncroots/webrev.00/>

I tested it by running with SPECjvm2008 and jcstress and found no
ill-effects.

Performance-wise it makes a very significant difference (running
gc-bench's roots.Sync test, which exaggerates synchronizer usage):

baseline:
[14,393s][info][gc,stats]     S: Thread Roots         =     0,34 s (a
=    37748 us) (n =     9) (lvls, us =    36523,    36523,    36914,   
37305,    42215)
[14,393s][info][gc,stats]     S: Synchronizer Roots   =     0,14 s (a
=    15115 us) (n =     9) (lvls, us =     9746,    10938,    14258,   
14648,    25847)
[14,393s][info][gc,stats]     UR: Thread Roots        =     0,22 s (a
=    24967 us) (n =     9) (lvls, us =    12305,    24219,    25977,   
27148,    27758)
[14,393s][info][gc,stats]     UR: Synchronizer Roots  =     0,11 s (a
=    11906 us) (n =     9) (lvls, us =     8340,     9082,    12109,   
12695,    13787)

patched:
[14,293s][info][gc,stats]     S: Thread Roots         =     0,36 s (a
=    40365 us) (n =     9) (lvls, us =    32031,    32031,    34570,   
37109,    67224)
[14,293s][info][gc,stats]     S: Synchronizer Roots   =     0,00 s (a
=        0 us) (n =     9) (lvls, us =        0,        0,       
0,        0,        0)
[14,294s][info][gc,stats]     UR: Thread Roots        =     0,22 s (a
=    24459 us) (n =     9) (lvls, us =    15820,    20508,    22070,   
26172,    32573)
[14,294s][info][gc,stats]     UR: Synchronizer Roots  =     0,00 s (a
=        0 us) (n =     9) (lvls, us =        0,        0,       
0,        0,        0)

Notice how thread roots scanning goes a little bit up, but by far not as
much as sync root scanning goes down.

If you think what I'm doing is sane, this might even be useful for other
GCs (although they're probably not as much bound by roots scanning as
Shenandoah is).

Thanks, Roman