RFR (M): 8077144: Concurrent mark initialization takes too long

Thomas Schatzl thomas.schatzl at oracle.com
Tue Mar 29 08:44:41 UTC 2016


Hi,

On Fri, 2016-03-25 at 21:38 -0400, Kim Barrett wrote:
> > On Mar 15, 2016, at 6:12 PM, Thomas Schatzl <
> > thomas.schatzl at oracle.com> wrote:
> > 
> > Hi Mikael,
> > 
> >  updated webrev at
> > http://cr.openjdk.java.net/~tschatzl/8077144/webrev.3/ (full)
> > http://cr.openjdk.java.net/~tschatzl/8077144/webrev.2_to_3/ (diff)
> > 
> > which implements the suggested changes.
> 
> I've run out of steam for looking at g1ConcurrentMark.cpp changes for
> now.  I may have more comments there later.  I like the new approach
> though. 
> 
> ---------------------------------------------------------------------
> --------- 
> src/share/vm/gc/g1/g1ConcurrentMark.cpp
>   51 #include "logging/logTag.hpp"
> 
> Why add this #include?  log.hpp appears to be the "include all of
> logging" header, and it's already included.  And I didn't find
> anything specific to log tags in the changes.

Removed with JDK-8151386.

> ---------------------------------------------------------------------
> --------- 
> src/share/vm/gc/g1/g1ConcurrentMark.cpp  
> 1142   inline void set_card_bitmap_range(BitMap* bm,
> 
> Is this *really* necessary?  I suspect set_range with a small_range
> hint would get much of the benefit (especially with the
> _last_marked_bit_idx optimization in the caller), with a *lot* less
> code.  E.g.
> 
>   bm.set_range(beg, end, BitMap::small_range);
> 
> And that's ignoring the question of where this should actually cross
> over between looping and calling set_range; do you have measurements
> to show that 8 is actually a good crossing point?

I do not know, this is just existing code. This would require
significant effort to (dis-)prove this, and I do not have time to look
into this at this point.

I believe it does not matter as long as the suspect common case, either
one or two bits filled are covered by extra handling.

Note that set_range(.., BitMap::small_range) is still pretty
heavyweight (and it does not cover the two-bit case), although in
general I think this will be kind of lost in the noise.

I filed JDK-8152932 to investigate this.

> Just using set_range would also make this assert superfluous:
> 1148     assert(end_idx <= bm->size(), "sanity");

Actually the assert itself is superfluous, as all BitMap methods check
ranges by themselves. 

Removed with JDK-8151386.

> ---------------------------------------------------------------------
> --------- 
> src/share/vm/gc/g1/g1ConcurrentMark.cpp  
> 1151     // use par_at_put_range (if parallel). The range is made up
> of the
> 
> I see no parallelism here.  Maybe this comment has some left over cut
> and paste?

Already removed with JDK-8151386.

> ---------------------------------------------------------------------
> --------- 
> src/share/vm/gc/g1/g1ConcurrentMark.cpp 
> 1199     BitMap::idx_t card_num = (BitMap::idx_t)(uintptr_t(addr) >>
> CardTableModRefBS::card_shift);
> 
> I would like the cast to BitMap::idx_t removed; I don't think it
> serves any useful purpose because idx_t is size_t is big enough.  And
> if it did perform a narrowing, I'd rather not have that hidden
> if/when
> we someday turn on -Wconversion or the like.
> 
> Personally, I'd prefer an explict reinterpret_cast for the uintptr_t
> conversion, but I realize using C++ cast operators is still kind of 
> an unusual thing in this code base.

I would prefer to do this in a more global extra sweep if we consider
it appropriate to do in the future. However I removed both casts with
JDK-8151386.

> 1286       (BitMap::idx_t)(uintptr_t(G1CollectedHeap::heap()
> ->reserved_region().start()) >> CardTableModRefBS::card_shift);
> 
> Same thing here.
> 
> ---------------------------------------------------------------------
> --------- 
> src/share/vm/gc/g1/g1ConcurrentMark.cpp
> 1260       int obj_sz = obj->size();
> ...
> 1268       marked_bytes += (size_t)obj_sz * HeapWordSize;
> 
> [Pre-existing, feel free to ignore.]
> 
> obj_sz could be declared size_t, since obj->size() is not going to
> 
> return a negative value, and then the cast could be removed.

Do you prefer the cast in 1260? Done.

> ---------------------------------------------------------------------
> ---------
> src/share/vm/gc/g1/g1ConcurrentMark.cpp
> 1206   void set_bit_for_region(HeapRegion* hr) {
> 1207     BitMap::idx_t index = (BitMap::idx_t) hr->hrm_index();
> 1208     _region_bm->par_at_put(index, true);
> 1209   }
> 
> This is the only use of _region_bm in the G1LiveDataHelper class. Of
> the three places that make one of these, two provide a region bitmap
> and one passes in NULL, meaning for that one it can't actually call
> this function, which suggests this function shouldn't really be part
> of the helper class. (It could be a static member taking a bitmap
> argument, or a similar file scoped helper function.)

That code goes away as soon as the remembered set changes are in. As
mentioned in the review for JDK-8151386, I do not think bothering about
this now is worth the time.

> 
> [Pre-existing]
> The explicit cast of hrm_index to BitMap::idx_t serves no useful
> purpose, and if it were a narrowing conversion the cast would just
> mask that if we ever turn on -Wconversion or the like.

Already fixed in JDK-8151386.

> ---------------------------------------------------------------------
> ---------
> src/share/vm/gc/g1/g1ConcurrentMark.cpp  
> 1281     //assert(region_bm != NULL, "");
> 1282     assert(card_bm != NULL, "");
> 
> Left-over debugging code?  Later: Oh, the commented out assert is
> related to the above discussion about set_bit_for_region.

Already fixed in JDK-8151386.

> ---------------------------------------------------------------------
> ---------
> src/share/vm/gc/g1/g1ConcurrentMark.cpp 
>  688 void G1ConcurrentMark::cleanup_for_next_mark() {
> ...
>  699   clear_bitmap(_nextMarkBitMap, _parallel_workers, true);
>  700 
>  701   // Clear the live count data. If the marking has been aborted,
> the abort()
>  702   // call already did that.
>  703   if (!has_aborted()) {
>  704     clear_all_live_data(_parallel_workers);
>  705     DEBUG_ONLY(verify_all_live_data());
>  706   }
> 
> [Pre-existing]
> abort() also clears the _nextMarkBitMap, but appears to do so in STW.
> Seems like one of those is superfluous.  And wouldn't it be better to
> not do any of these in (STW) abort() but rather do all of them here?

Abort() can only occur in full gc, so the extra card bitmap clear does
not matter much.

Also, the existing marking "state" machine assumes that after full gc
the bitmaps/card live data is clear, so this would need extensive
changes there.
Further, this would also mean that after a full gc G1 can't immediately
start marking, as it needs to clear this information first, possibly
leading to another full gc right away (in border conditions) because it
does take some time.

I filed JDK-

> ---------------------------------------------------------------------
> ---------
> src/share/vm/gc/g1/g1ConcurrentMark.cpp  
> 2496       _bitmap->clear_range(start, end);
> 
> If it weren't for the risk of tripping the (IMO bogus) assert on
> minimum size in BitMap::clear_large_range, I would suggest calling
> clear_range with a large_range hint.

Let's make the discussion about usefulness of the various BitMap
methods as another discussion. It does not matter.

> ---------------------------------------------------------------------
> --------- 
> src/share/vm/gc/g1/g1ConcurrentMark.cpp 
> 2475 class G1ClearAllLiveDataTask : public AbstractGangTask {
> ...
> 2494       BitMap::idx_t start = M * BitsPerByte * to_process;
> 2495       BitMap::idx_t end = MIN2(start + M * BitsPerByte, _bitmap
> ->size());
> ...
> 2501 void G1ConcurrentMark::clear_all_live_data(WorkGang* workers) {
> ...
> 2506   size_t const num_chunks =
> align_size_up(_card_live_bm.size_in_words() * HeapWordSize, M) / M;
> 
> These two snippets implicitly share knowledge that the chunking size
> for tasks is 1Mbyte.  One uses "chunk" and the other uses "task"
> terminology, which further obscures this relationship.  Consistent
> terminology and a static constant in G1ClearAllLiveDataTask that is
> used in both places would help.

Fixed this with JDK-8151386.

> ---------------------------------------------------------------------
> ---------
> src/share/vm/gc/g1/g1ConcurrentMark.cpp
> 1130 class G1LiveDataHelper VALUE_OBJ_CLASS_SPEC {
> ...
> 1142   inline void set_card_bitmap_range(BitMap* bm,
> 
> This uses non-parallel set_bit and set_range. Parallel callers are
> operating on distinct heap regions. If that isn't guaranteed, then
> two
> parallel tasks could try to modify bits in the same BitMap word
> simultaneously. We're OK so long as G1HeapRegionSize is a multiple of
> 
>   BitsPerWord * (1 << CardTableModRefBs::card_shift)
> 
> (which is 2^15 with 512 byte cards on a 64bit machine), but
> G1HeapRegionSize is a product option with a relatively unconstrained
> value (only bounded between 1M and 32M, no alignment constraint).
> 
> Oh, but, it looks like HeapRegion::setup_heap_region_size ensures the
> region size is a power of 2 within the min/max bounds, rounding up an
> arbitrary command line value if necessary.
> 
> So we're OK, though it required some searching to convince myself of
> that.  An assertion somewhere in G1LiveDataHelper might be
> appropriate, since it doesn't ever use par_xxx bitmap functions,
> unlike the old code, which selected between them based on an argument
> that no longer exists.

The parallel version in the old code were superfluous and just slowing
down execution.

The check is in JDK-8151386, which as just saw is not complete. I fixed
it there.

> ---------------------------------------------------------------------
> --------- 
> src/share/vm/gc/g1/g1ConcurrentMark.hpp
>  307   // A set bit indicates that the given card contains a live
> object. 
> 
> "... contains a live object." is perhaps imprecise. The card might
> contain multiple objects, and might contain parts of one or two
> objects that cross either card boundary. Something like "... contains
> at least part of at least one live object." would be more precise.
> But maybe I'm being overly pedantic.

In JDK-8151386 I fixed this a little, but I changed it to your
suggestions there.

> ---------------------------------------------------------------------
> ---------
> src/share/vm/gc/g1/g1ConcurrentMark.hpp
>  266 class G1ConcurrentMark: public CHeapObj<mtGC> {
> ...
>  280 protected:
> 
> [Pre-existing, feel free to ignore.]
> Why is all this stuff protected rather than private?


> 
> ---------------------------------------------------------------------
> ---------
> src/share/vm/gc/g1/g1ConcurrentMark.hpp
>  855   // Precondition: obj is below region's NTAMS.
> 
> There is no longer an argument called "region"; it's now the region
> containing obj. (Actually, it always ways, but there used to also be 
> a precondition that obj was in region.)

Fixed in JDK-8151386.

> ---------------------------------------------------------------------
> ---------
> src/share/vm/gc/g1/g1OopClosures.inline.hpp
>  134     _cm->grayRoot(obj, hr);
> 
> With the removal of _worker_i as an argument here, the _worker_i
> member of G1RootRegionScanClosure appears to be unused.

Good catch, thanks. Fixed in JDK-8151386.

> ---------------------------------------------------------------------
> ---------
> src/share/vm/utilities/bitMap.cpp
>   72   os::pretouch_memory((char*)word_addr(0),
> (char*)word_addr(size()));
> 
> Casts to char* are no longer needed, since fix for JDK-8151414.
> 
> Also, copyright needs updating. 

Fixed in JDK-8151386.

> 
> ---------------------------------------------------------------------
> ---------
> src/share/vm/utilities/bitMap.hpp
>  147   static idx_t size_in_words(size_t size_in_bits) {
> 
> I find the name of this function confusing in conjunction with the
> no-arg ordinary member function; I'm not keen on overloads with with
> very different semantics and usage. Clearer (to me) here might be
> something like calc_size_in_words.

Fixed in JDK-8151386.

> Also, copyright needs updating. 

Already fixed in JDK-8151386.

Thanks for your very extensive review :) As you noticed I deferred the
actual fixes to JDK-8151386, as almost all of these suggestions here
would conflict with that change. If you think I should fix and rebase
everything in this CR, I propose we just merge JDK-8151386 into this
CR.

Typically I do tend to first do the cleanups and then fix, but
initially I did not actually intend to do more thorough cleanups in
this area, and just focus on the performance aspect. However at this
time it would be a non-trivial amount of work to change the order of
these fixes, hence my proposal to merge JDK-8151386, so I hope this
order is acceptable.

I will soon post a new webrev for JDK-8151386 incorporating your
suggestions.

Thanks,
  Thomas




More information about the hotspot-gc-dev mailing list