RFC (S): Prefetching during mark scans

Wed Nov 2 13:49:15 UTC 2016

On 11/02/2016 02:12 PM, Roman Kennke wrote:
> - You said the users of the bitmap improve. You're prefetching the oop
> though. Would it be useful to prefetch the bitmap too?

I don't think so. At least the HWC profiling does not show the bitmap as
the source of cache misses during the scan. I think the reason for that
is the bitmap read is actually linear, and well predicted/prefetched by
the hardware. What we do is prefetching the data potentially far away --
oop contents.

> - You're prefetching for read. However, most users also write. Maybe
> prefetch for write too? That would be 2 different writes though: either
> the copy location, and in another case the updating of refs.

Yes, I thought about it, but the Prefetch::write says...

inline void Prefetch::write(void *loc, intx interval) {
#ifdef AMD64

  // Do not use the 3dnow prefetchw instruction.  It isn't supported on
em64t.
  //  __asm__ ("prefetchw (%0,%1,1)" : : "r" (loc), "r" (interval));
  __asm__ ("prefetcht0 (%0,%1,1)" : : "r" (loc), "r" (interval));

#endif // AMD64
}

...and we do the same instruction as in Prefetch::read. I think that
part needs overhaul in Hotspot!

We can specialize marked_object_iterate to accept either e.g.
ObjectReadClosure or ObjectWriteClosure, and do the Prefetch::write for
the "write" one. Maybe there is a better way to communicate the hint
without messing up the hot loop...

Thanks,
-Aleksey

> Roman
> 
> Am Mittwoch, den 02.11.2016, 13:33 +0100 schrieb Aleksey Shipilev:
>> Hi,
>>
>> This describes the work in progress, but I would like early
>> feedbacks,
>> because re-running perf experiments is tedious, and every little
>> change
>> there affects performance.
>>
>> Not a surprise that our GC blows the CPU caches when walking the
>> heap.
>> Within the mark phase, there is little we can do, because the object
>> graph is random in worst case. But once we have marked, we have the
>> marked addresses bitmap in our hands, which we scan *linearly*. Which
>> means, knowing that we will access oop fields, headers, etc. while
>> scanning that bitmap, we could prefetch oop contents in advance, long
>> before we actually reference it.
>>
>> This is the prototype patch that affects only mark-compact via
>> ShenandoahHeapRegion::marked_object_iterate:
>>   http://cr.openjdk.java.net/~shade/shenandoah/markscan-prefetch/webr
>> ev.00/
>>
>> It does improve Full GC times significantly, because the users of
>> marked
>> bitmap (Calculate Addresses, Adjust Pointers, Copy Objects) improve:
>>   http://cr.openjdk.java.net/~shade/shenandoah/markscan-prefetch/pref
>> etches
>>
>> Roman is exploring whether we can merge ShenandoahHeapRegion and
>> ShehandoahHeap versions of marked_object_iterate, and I would
>> forward-port the patch there. After that, the prefetching would also
>> affect our regular concurrent GC (e.g. the scan for concurrent
>> evacuation).
>>
>> Thanks,
>> -Aleksey
>>