RFR: ZGC: Add support for JFR leak profiler

Tue Dec 10 11:46:28 UTC 2019

Looks good.

I think it would be nice to retain/add a comment that we don't follow 
code cache roots:
https://cr.openjdk.java.net/~eosterlund/8235174/webrev.02/src/hotspot/share/jfr/leakprofiler/chains/rootSetClosure.cpp.udiff.html

We could probably include the misaligned code cache oops by creating 
temporary, aligned "surrogate" oop*s in, for example, a growable array. 
Maybe create a bug/rfe for that?

Thanks,
StefanK

On 2019-12-09 10:16, Erik Österlund wrote:
> Hi,
> 
> I renamed some functions from oops_do to weak_oops_do to follow our 
> naming conventions, and changed the filtering
> in the (now) weak_oops_do function to check if the raw oop is not null, 
> instead of is_dead() which now has a new
> meaning since last webrev.
> 
> Incremental:
> http://cr.openjdk.java.net/~eosterlund/8235174/webrev.01_02/
> 
> Full:
> http://cr.openjdk.java.net/~eosterlund/8235174/webrev.02/
> 
> This version has passed tier1-5.
> 
> Thanks,
> /Erik
> 
> On 2019-12-04 17:16, Erik Österlund wrote:
>> Hi Stefan,
>>
>> Thanks for the review.
>>
>> On 2019-12-04 14:55, Stefan Karlsson wrote:
>>> Hi Erik,
>>>
>>> This looks good to me!
>>
>> Thanks!
>>
>>> I think it would be good for the code to limit the number of 
>>> UnifedOopRef::addr calls, since those are unsafe ways into the the 
>>> addr and _could_ be abused to go around the UnifiedOopRef 
>>> encapsulation. Maybe something for the future.
>>
>> Yeah. I wrote that function because I noticed that the address was 
>> used with literally every single
>> address looking type I could imagine, in different places (uintptr_t, 
>> address, char*, oop*, void*,
>> you name it). So I just wrote the addr<T> function to be able to 
>> perform this change as mechanically
>> as possible.
>>
>>> Just a small nit:
>>>
>>> https://cr.openjdk.java.net/~eosterlund/8235174/webrev.00/src/hotspot/share/jfr/leakprofiler/checkpoint/eventEmitter.cpp.frames.html 
>>>
>>>
>>> Now that you've limited the use of object_addr, maybe you could also 
>>> limit its scope by moving
>>>
>>> 113   const oop* object_addr = sample->object_addr();
>>>
>>> to around:
>>>
>>> 123 edge = 
>>> edge_store->put(UnifiedOopRef::encode_in_native(object_addr));
>>
>> Absolutely.
>>
>> Here is a new webrev with your proposed change, and with some changes 
>> from Markus (big thanks to Markus for looking at this in great detail):
>> http://cr.openjdk.java.net/~eosterlund/8235174/webrev.01/
>>
>> Incremental:
>> http://cr.openjdk.java.net/~eosterlund/8235174/webrev.00_01/
>>
>> His changes mostly massage some types and order some includes. For 
>> example Markus prefers passing around UnifiedOopRef
>> as non-const (now that it is a POD, it's like passing around an integer).
>>
>> The most important change though, was fixing a bug in my patch by 
>> making sure that ObjectSampler::weak_oops_do
>> actually clears the oops that are dead. I thought it did, like the 
>> other weak_oops_do functions. Turns out
>> that it has not been needed until now. The oops were not cleared, yet 
>> the ObjectSample is reused as some
>> kind of cached object (presumably to call malloc less). The 
>> consequence was that when the ObjectSample gets subsequently reused 
>> with a dead oop in it, set_object() now using 
>> NativeAccess<ON_PHANTOM_OOP_REF>::oop_store
>> with my patch crashes things, because the G1 barrier backend SATB 
>> enqueues what was there in memory before
>> (which is dead garbage memory).
>>
>> Now one might argue that it's really silly of G1 barriers to SATB 
>> enqueue on non-strong stores (because the very
>> snapshot that SATB tracks is the strong graph, which is the graph 
>> being marked concurrently, nothing else). But I
>> would like to change that heuristic weirdness in a separate change. In 
>> this patch I am happy with the object simply
>> being cleared in weak_oops_do, like it is in all other weak_oops_do 
>> functions. That fixes the issue.
>>
>> Before there was a separate _dead boolean to track that the 
>> ObjectSample is dead. Now that the oop is cleared
>> by the GC when it dies, the boolean is no longer needed, because the 
>> is_dead() function can simply look at the oop.
>>
>> Thanks,
>> /Erik
>>
>>> Thanks,
>>> StefanK
>>>
>>> On 2019-12-02 11:09, erik.osterlund at oracle.com wrote:
>>>> Hi,
>>>>
>>>> The JFR leak profiler is currently not supported on ZGC, because it 
>>>> has not been accessorized appropriately.
>>>> This patch adds the required Access API sprinkling necessary, to 
>>>> enable this for ZGC. I did the following:
>>>>
>>>> 1) Renamed UnifiedOop to UnifiedOopRef, because it describes the 
>>>> reference to an oop, not the oop itself.
>>>> 2) Changed UnifiedOopRef to be a POD, instead of an AllStatic that 
>>>> passes around its encoded data as
>>>>    const oop* (even though it might be a tagged pointer to a narrow 
>>>> oop), improving type safty.
>>>> 3) Added another tag bit to UnifiedOopRef, allowing it to keep track 
>>>> of whether the described location is
>>>>    IN_NATIVE or IN_HEAP. The dereference function performs the 
>>>> appropriate NativeAccess/HeapAccess with the
>>>>    AS_NO_KEEPALIVE decotator, which is fine because the leak 
>>>> profiler does not create any edges to oops
>>>>    found during traversal.
>>>> 4) Removed code blob oop iteration, because it can't be supported 
>>>> appropriately by the current UnifiedOopRef
>>>>    mechanism (because oops are misaligned, and UnifiedOopRef 
>>>> requires tag bits - I believe this has never
>>>>    worked properly)
>>>> 6) Accessorized loads on strong roots to 
>>>> NativeAccess<AS_NO_KEEPALIVE>::oop_load. JFR leak profiler never
>>>>    creates new references to oops that are visited, so it does not 
>>>> need to keep things alive
>>>> 7) Explicitly told the oop iteration closures to not visit 
>>>> referents. The default referent strategy used
>>>>    before is DO_DISCOVERY, which doesn't seem right.
>>>> 8) Fixed a pre-existing issue where the array index reported in the 
>>>> leak chain was calculated incorrectly
>>>>    as the byte offset from the array base. It should be scaled by 
>>>> the element length of the array.
>>>>
>>>> Ran some JFR leak profiler tests, and some ad-hoc leak profiling in 
>>>> various applications. Also checked that
>>>> there were no observable perf regressions in the iterations, and 
>>>> there were not.
>>>>
>>>> Bug:
>>>> https://bugs.openjdk.java.net/browse/JDK-8235174
>>>>
>>>> Webrev:
>>>> http://cr.openjdk.java.net/~eosterlund/8235174/webrev.00/
>>>>
>>>> Thanks,
>>>> /Erik
>>>
>>
>