RFR: 8376839: GenShen: Improve performance of evacuations into the old generation

Kelvin Nilsen kdnilsen at openjdk.org
Sat Jan 31 01:26:43 UTC 2026


On Sat, 31 Jan 2026 00:11:07 GMT, William Kemper <wkemper at openjdk.org> wrote:

> When GenShen evacuates an object into the old generation, it also dirties the card for that object and updates the offsets of the first and last object in the card. In many cases, the same card may dirtied repeatedly and the object starts updated unnecessarily. We can reduce the total amount of work by moving these operations into a separate phase of the cycle which allows them to be batched.

src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.cpp line 41:

> 39:   ShenandoahScanRemembered*   const _scanner;
> 40: 
> 41: public:

I had not initially appreciated that we are investing in more precise dirtying of cards as part of this PR.  Please check my analysis of the tradeoffs here:

Option 1 (as currently implemented):
1. This will take longer to do entry_update_card_table() because we have to rescan every copied object.  This rescanning may also result in increased contention for cache lines and memory bus with mutator threads during this phase.  This also results in redundant dirtying of cards for any card that holds more than one "interesting pointer".
2. The benefit of this option is that our subsequent scan-remembered pass will have less work to do because potentially fewer cards will need to be scanned.

Option 2(blindly dirty the entire range of copied objects):
1. This matches current implementation.  The existing design is based on the idea that it is "overall" more efficient to scan this data once rather than twice.  We'll scan the data once when we next scan remembered set.
2. The scan-once benefit applies only to cards that are dirty.  In option 1, we scan data corresponding to dirty cards twice.  In option 2, we scan data corresponding to dirty cards once.  Data corresponding to clean cards is scanned only once in either option, though the timing of when we scan that data is different.

I'm wondering if we've done any experiment to evaluate the tradeoffs of these alternative approaches on various workloads?

src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.cpp line 210:

> 208:   log_debug(gc, remset)("Update remembered set from " PTR_FORMAT ", to " PTR_FORMAT, p2i(start), p2i(end));
> 209: 
> 210:   while (address < end) {

I'm assuming there must be a preparatory pass over all cards to pre-initialize each one, denoting that each card does not hold the start of an object.  Then, this loop changes that state only for the cards that do hold the start of an object.

I haven't worked through the all the details, so my intuition may be wrong here.  But it feels to me like we could skip the preparatory pass by making a small change to how this loop is structured.  The following is my "first" impulse for how I would write this loop.  I'm not sure it's better, but offer it for your consideration.

next_relevant_object = address
For each card_index in the range:
  if the next_relevant_object pertains to this card {
    set_first_start(card_index, offset_in_card(next_relevant_object))
    while (next_relevant_object + next_relevant_object->size() < addr_for_card_index(card_index+1)) {
      next_relevant_object += next_relevant_object->size();
     }
     set_last_start(card_index, offset_in_card(next_relevant_object);
     next_relevant_object += next_relevant_object->size();
  } else {
    clear_card_status(card_index);  // no objects start in this card's range
  }

src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.cpp line 223:

> 221:       }
> 222: 
> 223:       current_card_index = object_card_index;

IIUC, I believe current_card_index corresponds to previous_offset and previous_address in the next iteration of this loop.  For clarity in naming of variables, would it make sense to call this "previous_card_index"?

src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.cpp line 234:

> 232: 
> 233:     const oop obj = cast_to_oop(address);
> 234:     address += obj->oop_iterate_size(&make_cards_dirty);

It feels to me like this code will still redundantly mark a card dirty for as many objects as touch this card.  Wouldn't it be faster to have a single call outside this loop to mark all cards dirty in the range from address to end?

src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.cpp line 238:

> 236: 
> 237:   // Register the last object seen in this range.
> 238:   set_last_start(current_card_index, previous_offset);

It seems this statement should only be executed if previous_address != nullptr

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/29511#discussion_r2748636778
PR Review Comment: https://git.openjdk.org/jdk/pull/29511#discussion_r2748623406
PR Review Comment: https://git.openjdk.org/jdk/pull/29511#discussion_r2748607951
PR Review Comment: https://git.openjdk.org/jdk/pull/29511#discussion_r2748599979
PR Review Comment: https://git.openjdk.org/jdk/pull/29511#discussion_r2748610554


More information about the hotspot-gc-dev mailing list