RFC: Parallel deferred updates

Fri Aug 19 10:47:01 UTC 2022

Hi again,

On 19.08.22 12:42, Thomas Schatzl wrote:
> Hi,
> 
> On 12.08.22 17:42, Nick Gasson wrote:
>> Hi Thomas,
> 
>    apologies for the late replies (and potentially some in the next 
> time), as I'm on and off on vacation right now.
> 
>>
>> Thanks for the feedback.  I've created JDK-8292296 in the JBS for this.
>>
>> On 11/08/22 17:30 pm, Thomas Schatzl wrote:
>>>
>>> Sounds good, although I would make it just a little more complicated:
>>> maybe it's useful to actually know the number of valid RegionData and
>>> counting these when they are set to size the number of threads. And/or
>>> the number of object arrays (of a particular size, e.g. larger than the
>>> threshold to split them during marking?) as proxy of "lots of work for
>>> that object crossing the region"/"worth spinning up a thread". (if that
>>> is possible at all)
>>
>> One thing we could do is add a shared counter that's incremented
>> whenever RegionData::set_deferred_obj_addr() is called with a non-NULL
>> pointer.  That could be used to decide how many threads to use for the
>> update processing, but it doesn't tell you how many of those have a
>> large number of embedded oops.  For that you'd need to examine the class
>> and I'm not sure that's safe to do inside PSParallelCompact::fill_region
>> - what if the class pointer was spilled into the next region?
> 
> During this phase the object's size can be inferred by the bitmaps; 
> actually if you look e.g. at psParallelCompact.cpp:2900 it looks like 
> that before getting into the "ParMarkBitMap::would_overflow" branch, 
> that object's size is already calculated always. (I.e. that object's end 
> searched for, only that final size calculation is conditional, but 
> that's just a pointer_delta()).
> 
> So it looks like we have the object sizes at our disposal after all.

I was wrong about that, the bitmap iteration call at line 2898 might 
directly return ParMarkBitMap::would_overflow I guess, at least from 
cursory browsing the code.

However that ParMarkBitmapClosure::do_addr called by 
ParMarkBitmap::iterate already needs the size calculated, so maybe by 
some refactoring that value could be reused.

> 
>>>
>>> While in SPECjbb according to your description it seems fairly clear
>>> that it is useful to parallelize every time with all resources as
>>> each/most of these objects cause lots of work, it seems to be
>>> disadvantageous to spin up lots of threads if there is not.
>>>
>>> I would be interested in the Deferred Updates timing changes for the
>>> other benchmarks. Maybe there is nothing to see here, but idk whether
>>> you looked only at overall scores for them or did some more detailed
>>> analysis.
>>>
>>
>> OK sure, I can collected more data for benchmarks other than SPECjbb.
> 
> Thanks, would be nice.

Thomas