[foreign] RFR 8224481: Optimize struct getter and field getter paths.

Maurizio Cimadamore maurizio.cimadamore at oracle.com
Wed May 22 22:28:43 UTC 2019


I did some more analysis on the ClassValue issue and I'm now convinced 
that what we are doing is _not_ problematic.

What we really care about here is that, if we create a 
Reference.OfStruct for class Foo, we don't want the ClassValue we're 
using to cache that reference to prevent unloading of Foo. That is 
slightly different problem than the one described in [1]. There, the 
issue is that the storage associated with ClassValue (which lives inside 
Class objects) keeps growing indefinitively, in case where the computed 
values keep strong references to the ClassValue itself. This is due to 
the way in which ClassValue behaves.

A ClassValue is not an ordinary map - rather, when you call 'get' on a 
ClassValue with a given class C, you really ask the Class object for C 
for its ClassValue storage (a so called ClassValueMap abstraction). This 
map is essentially a WeakHashMap<Identity, Entry>, where Identity is a 
field of the ClassValue uniquely identifying it, whereas Entry contains 
the computed value associated with that class and ClassValue instance 
(the Entry class has a lot of extra complexity to deal with versioning, 
which is irrelevant here).

So, if the ClassValue instance goes away, the fact that we're using a 
WeakHashMap here, allows the map to shrink in size. Of course, for this 
to happen, you don't want to have a strong reference from Entry (that 
is, from the computed value) back to the ClassValue instance - as in 
doing so you will prevent collection of the WeakHashMap entries.

The bug in [1] shows that, when that happens, it is essentially possible 
to grow the WeakHashMap attached to a class object at will, until an 
OOME is produced.

But in our case we're not concerned with the fact that we keep adding 
multiple ClassValue to the _same_ class object; it's actually the 
opposite: we have a single ClassValue (in References.OfStruct), and many 
classes. In such a case, when the class goes away (because its 
classloader goes), it will just go away; there will be nothing 
preventing the collection of that class.

Attached is a test (with two files, Test.java and Dummy.java) - Test 
creates a new class loader, loads Dummy in it, and then stash a value 
for the Dummy class into a shared ClassValue. To make things as nasty as 
possible, the value we're storing has strong references to both the 
Dummy class and the ClassValue itself. But, as soon as the loader is 
closed, the finalizer is run as expected and memory usage remains under 
control.


So, popping back to our enhancements, I think what the patch does is 
legit. In terms of the code, I don't like how the code made OfStruct 
_not_ a Reference, and is instead using OfStruct as a holder for some 
helper functionalities, plus the cache, whereas the real reference is an 
anonymous class generated inside the computeValue() method.

It seems to me that we could have Reference.OfStruct keep being a 
Reference, have a constructor that takes a Class object, and then have a 
static ClassValue field in References which, upon computeValue creates a 
new instance of Reference.OfStruct for that class. I think the 
implementation would be a lot more linear that way (unless I'm missing 
something).

Cheers
Maurizio

[1] - https://bugs.openjdk.java.net/browse/JDK-8136353

On 22/05/2019 16:30, Maurizio Cimadamore wrote:
> Looks good - module pending questions on use of ClassValue.
>
> I think we should come up with some kind of test case that shows the 
> ClassValue issue and then test with different approaches
>
> Maurizio
>
> On 22/05/2019 16:09, Jorn Vernee wrote:
>> Coming back to this once more,
>>
>> I finally got my profiler working (after setting up a separate 
>> project) and saw a lot of time spent getting the field offset:
>>
>>  37.00%       c2, level 4 
>> jdk.internal.foreign.LayoutPaths$$Lambda$66.0x0000000800c05040::getAsLong, 
>> version 691
>>  30.19%       c2, level 4 
>> jdk.internal.foreign.RuntimeSupport::casterImpl, version 724
>>  22.12%       c2, level 4 
>> org.sample.generated.GetStruct_panama_get_fieldonly_jmhTest::panama_get_fieldonly_avgt_jmhStub, 
>> version 746
>>  ...
>>
>> i.e. the call to LayoutPath.offset() in RuntimeSupport::casterImpl 
>> can not be inlined, and we're re-computing the field offset over and 
>> over again.
>>
>> The fix for this is pretty simple; instead of passing the LayoutPath 
>> to the caster, we pre-compute the offset and then pass that. (This 
>> should be constant, right?).
>>
>> This yields some more speedup:
>>
>> Benchmark                        Mode  Cnt   Score   Error Units
>> GetStruct.jni_baseline           avgt   50  13.337 ▒ 0.251 ns/op
>> GetStruct.panama_get_both        avgt   50  17.026 ▒ 0.458 ns/op
>> GetStruct.panama_get_fieldonly   avgt   50   7.796 ▒ 0.166 ns/op
>> GetStruct.panama_get_structonly  avgt   50  11.863 ▒ 0.358 ns/op
>>
>> Putting us pretty much even with jni_baseline.
>>
>> Updated Webrev: 
>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.03/
>>
>> (Only changes are to RuntimeSupport)
>>
>> Cheers,
>> Jorn
>>
>> Jorn Vernee schreef op 2019-05-22 12:51:
>>> Ah, good point.
>>>
>>>> ClassValue<MH> -> MH -> StructImpl -> LayoutType -> Reference -> 
>>>> ClassValue<MH>
>>>
>>> I don't think that last link is quite right though. The LayoutType
>>> references the anonymous Reference class, not References.OfStruct
>>> (which contains the ClassValue).
>>>
>>> I think it would be:
>>>
>>> User Code -> LayoutType -> anonymous Reference -> getter MH ->
>>> StructImpl -> LayoutType
>>>
>>> There could still be a cycle there, but the whole cycle can be GC'd
>>> once the reference from user code goes away.
>>>
>>> Jorn
>>>
>>> Maurizio Cimadamore schreef op 2019-05-22 12:37:
>>>> Looks good - yesterday I was looking at this discussion:
>>>>
>>>> http://mail.openjdk.java.net/pipermail/mlvm-dev/2016-January/006563.html 
>>>>
>>>>
>>>> I hope we don't run in the condition described there - e.g. that
>>>> there's no strong reachability from the MH we're caching back to the
>>>> static ClassValue instance - because, if that would be the case I
>>>> think that would prevent class unloading.
>>>>
>>>> The problem is that the MethodHandle we cache refers to the stuct impl
>>>> class, and I believe that class refers to some LayoutTypes on its own,
>>>> which have a Reference inside, so it would be:
>>>>
>>>> ClassValue<MH> -> MH -> StructImpl -> LayoutType -> Reference -> 
>>>> ClassValue<MH>
>>>>
>>>> Sundar can you double check?
>>>>
>>>> Maurizio
>>>>
>>>> On 22/05/2019 10:56, Jorn Vernee wrote:
>>>>> Good suggestion! This solves the problem, is nice and simple, and 
>>>>> keeps the same times in the benchmark.
>>>>>
>>>>> Updated webrev: 
>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.02/
>>>>>
>>>>> (only changes to References.java)
>>>>>
>>>>> I've added a test for the failure. I think that can be included as 
>>>>> well? I re-ran the samples I have as well, and this time it's all 
>>>>> green.
>>>>>
>>>>> Thanks,
>>>>> Jorn
>>>>>
>>>>> Maurizio Cimadamore schreef op 2019-05-22 01:15:
>>>>>> On 21/05/2019 20:16, Jorn Vernee wrote:
>>>>>>> Although, now that you bring it up, I tried re-running some of 
>>>>>>> the samples (hadn't done that yet), and I'm seeing some infinite 
>>>>>>> recursion. This is seemingly caused by a circular type reference 
>>>>>>> (e.g. linked list). i.e. to spin the impl of an accessor we need 
>>>>>>> the LayoutType of the struct itself, which then tries to spin 
>>>>>>> the impl again, and so on. I guess this isn't a test case in our 
>>>>>>> suite yet...
>>>>>>>
>>>>>>> I'll look into this.
>>>>>>
>>>>>> Good detective work! I guess it would make sense to try and 
>>>>>> reduce it
>>>>>> down to a simpler test, and push the test first.
>>>>>>
>>>>>> Where I was going with this is - your patch effectively made the 
>>>>>> lazy
>>>>>> resolution inside StructImplGenerator useless. If we really want to
>>>>>> explore that option, then we should, I think, remove all lazy
>>>>>> resolution sites and see what happens. It is possible that we don't
>>>>>> rely so much on laziness as we did in the past (we did some fixes 
>>>>>> few
>>>>>> months ago which stabilized resolution quite a bit) - in which 
>>>>>> case we
>>>>>> can remove the resolution requests, although - I have to admit - 
>>>>>> I'm a
>>>>>> bit skeptical. After all all you need it something like this (as you
>>>>>> say):
>>>>>>
>>>>>> struct foo {
>>>>>>     struct foo *next;
>>>>>> }
>>>>>>
>>>>>> Which is kind of the killer app for unresolved layouts in the 
>>>>>> first place.
>>>>>>
>>>>>> This is translated into a struct interface which has a getter of
>>>>>> Pointer<foo>. To generate the getter you need to compute its
>>>>>> LayoutType which is a pointer LayoutType, so you have to compute the
>>>>>> pointee LayoutType which brings you back where you started (the 
>>>>>> whole
>>>>>> 'foo' LayoutType). In other words, since now the creation of
>>>>>> LayoutType<foo> requires the generation of the struct impl for 'foo'
>>>>>> and since that depends (indirectly, through the pointer getter) on
>>>>>> being able to produce a LayoutType<foo>, you get a circularity.
>>>>>>
>>>>>> One thing we could try is - instead of eagerly creating the struct
>>>>>> impl, why don't we let the Reference.OfStruct having some mutable
>>>>>> state in it? That is, we could start off with Reference getter which
>>>>>> does the expensive refelective lookup - but then, once it has
>>>>>> discovered the constructor MH, it can stash it in some field 
>>>>>> (which is
>>>>>> private to that reference object) and use it later if the getter is
>>>>>> used again. Then, you probably still need a ClassValue to stash a
>>>>>> mapping between a Class and its Reference.OfStruct; but it seems 
>>>>>> like
>>>>>> this could fit in more naturally?
>>>>>>
>>>>>> Maurizio
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jorn
>>>>>>>
>>>>>>> Jorn Vernee schreef op 2019-05-21 21:06:
>>>>>>>> Since we have the resolution context for NativeHeader, AFAIK 
>>>>>>>> there is
>>>>>>>> no more difference between the resolution call done by
>>>>>>>> StructImpleGenerator, and the one done by 
>>>>>>>> LayoutTypeImpl.ofStruct. So
>>>>>>>> I don't think there are any more cases where we would have 
>>>>>>>> succeeded
>>>>>>>> to resolve the Struct layout be delaying spinning the impl. At 
>>>>>>>> least
>>>>>>>> the tests haven't caught such a case.
>>>>>>>>
>>>>>>>> The other thing is that the partial layout for the getter is 
>>>>>>>> caught in
>>>>>>>> StructImplGenerator, but for the setter it's caught when calling
>>>>>>>> bitSize on Unresolved. Saying layouts should be able to be 
>>>>>>>> resolved
>>>>>>>> when calling LayoutType.ofStruct means we can use 
>>>>>>>> References.OfGrumpy,
>>>>>>>> which makes the two more uniform.
>>>>>>>>
>>>>>>>> I have some ideas for keeping the lazy init semantics, but it's 
>>>>>>>> a bit
>>>>>>>> more complex (using a MutableCallSite to mimic indy), and I'm 
>>>>>>>> not sure
>>>>>>>> it will work as well.
>>>>>>>>
>>>>>>>> And, well, there was some talk about eagerly spinning the
>>>>>>>> implementations any ways :)
>>>>>>>>
>>>>>>>> Jorn
>>>>>>>>
>>>>>>>> Maurizio Cimadamore schreef op 2019-05-21 20:09:
>>>>>>>>> Looks good, although I'm a bit worried about the change in 
>>>>>>>>> semantics
>>>>>>>>> w.r.t. eager instantiation. The binder will create a lot of
>>>>>>>>> LayoutTypes when generating the implementation - I wonder 
>>>>>>>>> there were
>>>>>>>>> cases before where we created a partial layout type, which 
>>>>>>>>> then got
>>>>>>>>> resolved correctly by the time it was dereferenced (since we do
>>>>>>>>> another resolve lazily in StructImplGenerator [1]).
>>>>>>>>>
>>>>>>>>> [1] -
>>>>>>>>> http://hg.openjdk.java.net/panama/dev/file/5ea3089be5ac/src/java.base/share/classes/jdk/internal/foreign/StructImplGenerator.java#l52 
>>>>>>>>> On 21/05/2019 14:41, Jorn Vernee wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> After the recent string of benchmarking [1], I've arrived at 
>>>>>>>>>> 2 optimizations to improve the speed of the measured code path.
>>>>>>>>>>
>>>>>>>>>> 1.) Specialization of Struct getter MethodHandles per struct 
>>>>>>>>>> class.
>>>>>>>>>> 2.) Implementation of RuntimeSupport::casterImpl that does a 
>>>>>>>>>> fused cast and offset operation, to avoid creating multiple 
>>>>>>>>>> Pointer objects.
>>>>>>>>>>
>>>>>>>>>> The benchmark: 
>>>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/bench/webrev.00/
>>>>>>>>>> The optimizations: 
>>>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.00/
>>>>>>>>>>
>>>>>>>>>> I've split these into 2 so that it's easier to run the 
>>>>>>>>>> benchmarks with and without the optimizations. (benchmark 
>>>>>>>>>> uses the OpenJDK's builtin framework [2]).
>>>>>>>>>>
>>>>>>>>>> Since we're now more eagerly instantiating the struct impl 
>>>>>>>>>> class I had to work around partial struct types, since 
>>>>>>>>>> spinning the impl requires a non-partial type and now we're 
>>>>>>>>>> spinning the impl when creating the LayouType for the struct, 
>>>>>>>>>> as opposed to on the first dereference. To do this I'm 
>>>>>>>>>> detecting whether the struct is partial in 
>>>>>>>>>> LayoutType.ofStruct, and using a Reference.OfGrumpy in the 
>>>>>>>>>> case where it can not be resolved. Tbh, I think this makes 
>>>>>>>>>> things a little more clear as well as far as where/how the 
>>>>>>>>>> exception for deref of a partial type is thrown.
>>>>>>>>>>
>>>>>>>>>> Results on my machine before the optimization are:
>>>>>>>>>>
>>>>>>>>>> Benchmark                        Mode  Cnt Score Error Units
>>>>>>>>>> GetStruct.jni_baseline           avgt   50 14.204 ▒ 0.566 ns/op
>>>>>>>>>> GetStruct.panama_get_both        avgt   50 507.638 ▒ 19.462 
>>>>>>>>>> ns/op
>>>>>>>>>> GetStruct.panama_get_fieldonly   avgt   50 90.236 ▒ 11.027 ns/op
>>>>>>>>>> GetStruct.panama_get_structonly  avgt   50 370.783 ▒ 13.744 
>>>>>>>>>> ns/op
>>>>>>>>>>
>>>>>>>>>> And after:
>>>>>>>>>>
>>>>>>>>>> Benchmark                        Mode  Cnt Score Error Units
>>>>>>>>>> GetStruct.jni_baseline           avgt   50 13.941 ▒ 0.485 ns/op
>>>>>>>>>> GetStruct.panama_get_both        avgt   50 41.199 ▒ 1.632 ns/op
>>>>>>>>>> GetStruct.panama_get_fieldonly   avgt   50 33.432 ▒ 1.889 ns/op
>>>>>>>>>> GetStruct.panama_get_structonly  avgt   50 13.469 ▒ 0.781 ns/op
>>>>>>>>>>
>>>>>>>>>> Where panama_get_structonly corresponds to 1., and 
>>>>>>>>>> panama_get_fieldonly corresponds to 2. For a total of about 
>>>>>>>>>> 12x speedup.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Jorn
>>>>>>>>>>
>>>>>>>>>> [1] : 
>>>>>>>>>> https://mail.openjdk.java.net/pipermail/panama-dev/2019-May/005469.html 
>>>>>>>>>> [2] : https://openjdk.java.net/jeps/230


More information about the panama-dev mailing list