RFR(M/L): 7176479: G1: JVM crashes on T5-8 system with 1.5 TB heap

Thu Jan 24 00:36:53 UTC 2013

Hi Vitalty,

On 1/23/2013 4:19 PM, Vitaly Davidovich wrote:
>
> Hi John,
>
> Thanks for this explanation as well.  I see what you're saying about 
> the concurrency control, but what I don't understand is when this is 
> called:
>
> void reset_hot_cache() {
> 107     _hot_cache_idx = 0; _n_hot = 0;
> 108   }
>
> Since these are plain stores, what exactly ensures that they're 
> (promptly) visible to other GC threads? Is there some dependency here, 
> e.g. if you see _n_hot = 0 then _hot_cache_idx must also be zero? I 
> strongly suspect I missed the details in your response that explain 
> why this isn't a concern.  Is there only a particular type of thread 
> that can call reset_hot_cache and/or only at a certain point? It kind 
> of sounds like it so don't know if there's an assert that can be added 
> to verify that.
>

At the point where this routine is called the GC workers have finished 
the parallel phases of the GC and are idle. The thread that is running 
is the VM thread. The rest of the VM is at a safepoint so we are, in 
effect, single threaded. Yes there is an assert we can add here:

assert(SafepointSynchronize::is_at_safepoint() && 
Thread::current()->is_VM_thread(), "...");

JohnC

> Thanks
>
> Sent from my phone
>
> On Jan 23, 2013 5:51 PM, "John Cuthbertson" 
> <john.cuthbertson at oracle.com <mailto:john.cuthbertson at oracle.com>> wrote:
>
>     Hi Vitaly,
>
>     Thanks for looking over the code changes. I'll respond to your
>     other comments in a separate email. Detailed responses inline....
>
>     On 1/15/2013 4:57 PM, Vitaly Davidovich wrote:
>
>
>         Hi John,
>
>         Wow, that's a giant heap! :)
>
>         I think G1ConcRSLogCacheSize needs to be validated to make
>         sure it's <= 31; otherwise, I think you get undefined behavior
>         on left shifting with it.
>
>
>     Good catch. Done.
>
>         I don't think you need _def_use_cache -- can be replaced with
>         G1ConcRSLogCacheSize > 0?
>
>
>     Done. I've added a function that returns the result of the
>     comparison and I use that in place of G1ConcRSLogCacheSize.
>
>         I'm sure this is due to my lack of G1 knowledge, but the
>         concurrency control inside g1HotCardCache is a bit unclear.
>         There's a CAS to claim the region of cards, there's a HotCache
>         lock for inserting a card.  However, reset_hot_cache() does a
>         naked write of a few fields.  Are there any visibility and
>         ordering constraints that need to be enforced? Do some of the
>         stores need an OrderAccess barrier of some sort, depending on
>         what's required? Sorry if I'm just missing it ...
>
>
>     The drain routine is only called from within a GC pause but it is
>     called by multiple GC worker threads. Each worker will claim a
>     chunk of cards using the CAS and refine them. Resetting the
>     boundaries (the values reset by reset_hot_cache()) in the drain
>     routine would be a mistake since a worker thread could see the new
>     boundary values and return, potentially leaving some cards
>     unrefined and some missing entries in remembered sets. I can only
>     clear the fields when the last thread has finished draining the
>     cache. The best place to do this is just before the VM thread
>     re-enables the cache (we know the worker threads will have
>     finished at this point). Since the "drain" doesn't actually drain,
>     perhaps a better name might be refine_all()?
>
>     The HotCache lock is used when adding entries to the cache.
>     Entries are added by the refinement threads (and there will most
>     likely be more than one). Since the act of adding an entry can
>     also evict an entry we need the lock to guard against hitting the
>     ABA problem. This could result in skipping the refinement of a
>     card, which will lead to missing remembered set entries which are
>     not fun to track down.
>
>     Draining during the GC is immune from the ABA problem because
>     we're not actually removing entries from the cache. We would still
>     be immune, however, if we were removing entries since we would not
>     be adding entries at the same time.
>
>     Thanks,
>
>     JohnC
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20130123/ff49868d/attachment.htm>