[foreign] RFR 8224481: Optimize struct getter and field getter paths.
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Thu May 23 13:20:08 UTC 2019
I like the new webrev - only minor quibble on naming - that is, the
'getter' field and the 'makeGetter' method would probably be better
named as 'specializedGetter' and
'specializeGetter'/'makeSpecializedGetter' respectively, to carry more
meaning.
No need for another review if you decide to go for the name change.
Cheers
Maurizio
On 23/05/2019 13:38, Jorn Vernee wrote:
> Response inline....
>
> Maurizio Cimadamore schreef op 2019-05-23 00:28:
>> I did some more analysis on the ClassValue issue and I'm now convinced
>> that what we are doing is _not_ problematic.
>>
>> What we really care about here is that, if we create a
>> Reference.OfStruct for class Foo, we don't want the ClassValue we're
>> using to cache that reference to prevent unloading of Foo. That is
>> slightly different problem than the one described in [1]. There, the
>> issue is that the storage associated with ClassValue (which lives
>> inside Class objects) keeps growing indefinitively, in case where the
>> computed values keep strong references to the ClassValue itself. This
>> is due to the way in which ClassValue behaves.
>>
>> A ClassValue is not an ordinary map - rather, when you call 'get' on a
>> ClassValue with a given class C, you really ask the Class object for C
>> for its ClassValue storage (a so called ClassValueMap abstraction).
>> This map is essentially a WeakHashMap<Identity, Entry>, where Identity
>> is a field of the ClassValue uniquely identifying it, whereas Entry
>> contains the computed value associated with that class and ClassValue
>> instance (the Entry class has a lot of extra complexity to deal with
>> versioning, which is irrelevant here).
>>
>> So, if the ClassValue instance goes away, the fact that we're using a
>> WeakHashMap here, allows the map to shrink in size. Of course, for
>> this to happen, you don't want to have a strong reference from Entry
>> (that is, from the computed value) back to the ClassValue instance -
>> as in doing so you will prevent collection of the WeakHashMap entries.
>>
>> The bug in [1] shows that, when that happens, it is essentially
>> possible to grow the WeakHashMap attached to a class object at will,
>> until an OOME is produced.
>>
>> But in our case we're not concerned with the fact that we keep adding
>> multiple ClassValue to the _same_ class object; it's actually the
>> opposite: we have a single ClassValue (in References.OfStruct), and
>> many classes. In such a case, when the class goes away (because its
>> classloader goes), it will just go away; there will be nothing
>> preventing the collection of that class.
>>
>> Attached is a test (with two files, Test.java and Dummy.java) - Test
>> creates a new class loader, loads Dummy in it, and then stash a value
>> for the Dummy class into a shared ClassValue. To make things as nasty
>> as possible, the value we're storing has strong references to both the
>> Dummy class and the ClassValue itself. But, as soon as the loader is
>> closed, the finalizer is run as expected and memory usage remains
>> under control.
>
> Thanks for the extensive research, and for explaining it! It's good to
> hear that using ClassValue won't be an issue for us.
>
> I tried out the test, and I'm also seeing the finalizer being run.
>
>> So, popping back to our enhancements, I think what the patch does is
>> legit. In terms of the code, I don't like how the code made OfStruct
>> _not_ a Reference, and is instead using OfStruct as a holder for some
>> helper functionalities, plus the cache, whereas the real reference is
>> an anonymous class generated inside the computeValue() method.
>>
>> It seems to me that we could have Reference.OfStruct keep being a
>> Reference, have a constructor that takes a Class object, and then have
>> a static ClassValue field in References which, upon computeValue
>> creates a new instance of Reference.OfStruct for that class. I think
>> the implementation would be a lot more linear that way (unless I'm
>> missing something).
>
> Yeah, I think doing that would make more sense. It would also help
> show what fields a struct Reference actually has.
>
> I've also added @Stable to the MethodHandle field (as suggested in
> your other email) and re-ran the benchmark, but did not see an obvious
> performance increase. I looked at the profile for
> `panama_get_structonly`, but nothing really stands out to me:
>
> 30.37% c2, level 4
> org.sample.generated.GetStruct_panama_get_structonly_jmhTest::panama_get_structonly_avgt_jmhStub,
> version 746
> 25.90% Unknown, level 0 java.lang.invoke.MethodHandle::invokeBasic,
> version 102
> 16.95% c2, level 4
> java.lang.invoke.LambdaForm$MH.0x0000000800c0a840::invoke, version 713
> 13.50% c2, level 4 org.sample.GetStruct::panama_get_structonly,
> version 711
>
> It looks like most time is spent on JMH overhead.
>
> Updated webrev with your suggestions:
> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.04/
>
> (Only changes to References)
>
> Jorn
>
>> Cheers
>> Maurizio
>>
>> [1] - https://bugs.openjdk.java.net/browse/JDK-8136353
>>
>> On 22/05/2019 16:30, Maurizio Cimadamore wrote:
>>> Looks good - module pending questions on use of ClassValue.
>>>
>>> I think we should come up with some kind of test case that shows the
>>> ClassValue issue and then test with different approaches
>>>
>>> Maurizio
>>>
>>> On 22/05/2019 16:09, Jorn Vernee wrote:
>>>> Coming back to this once more,
>>>>
>>>> I finally got my profiler working (after setting up a separate
>>>> project) and saw a lot of time spent getting the field offset:
>>>>
>>>> 37.00% c2, level 4
>>>> jdk.internal.foreign.LayoutPaths$$Lambda$66.0x0000000800c05040::getAsLong,
>>>> version 691
>>>> 30.19% c2, level 4
>>>> jdk.internal.foreign.RuntimeSupport::casterImpl, version 724
>>>> 22.12% c2, level 4
>>>> org.sample.generated.GetStruct_panama_get_fieldonly_jmhTest::panama_get_fieldonly_avgt_jmhStub,
>>>> version 746
>>>> ...
>>>>
>>>> i.e. the call to LayoutPath.offset() in RuntimeSupport::casterImpl
>>>> can not be inlined, and we're re-computing the field offset over
>>>> and over again.
>>>>
>>>> The fix for this is pretty simple; instead of passing the
>>>> LayoutPath to the caster, we pre-compute the offset and then pass
>>>> that. (This should be constant, right?).
>>>>
>>>> This yields some more speedup:
>>>>
>>>> Benchmark Mode Cnt Score Error Units
>>>> GetStruct.jni_baseline avgt 50 13.337 ▒ 0.251 ns/op
>>>> GetStruct.panama_get_both avgt 50 17.026 ▒ 0.458 ns/op
>>>> GetStruct.panama_get_fieldonly avgt 50 7.796 ▒ 0.166 ns/op
>>>> GetStruct.panama_get_structonly avgt 50 11.863 ▒ 0.358 ns/op
>>>>
>>>> Putting us pretty much even with jni_baseline.
>>>>
>>>> Updated Webrev:
>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.03/
>>>>
>>>> (Only changes are to RuntimeSupport)
>>>>
>>>> Cheers,
>>>> Jorn
>>>>
>>>> Jorn Vernee schreef op 2019-05-22 12:51:
>>>>> Ah, good point.
>>>>>
>>>>>> ClassValue<MH> -> MH -> StructImpl -> LayoutType -> Reference ->
>>>>>> ClassValue<MH>
>>>>>
>>>>> I don't think that last link is quite right though. The LayoutType
>>>>> references the anonymous Reference class, not References.OfStruct
>>>>> (which contains the ClassValue).
>>>>>
>>>>> I think it would be:
>>>>>
>>>>> User Code -> LayoutType -> anonymous Reference -> getter MH ->
>>>>> StructImpl -> LayoutType
>>>>>
>>>>> There could still be a cycle there, but the whole cycle can be GC'd
>>>>> once the reference from user code goes away.
>>>>>
>>>>> Jorn
>>>>>
>>>>> Maurizio Cimadamore schreef op 2019-05-22 12:37:
>>>>>> Looks good - yesterday I was looking at this discussion:
>>>>>>
>>>>>> http://mail.openjdk.java.net/pipermail/mlvm-dev/2016-January/006563.html
>>>>>> I hope we don't run in the condition described there - e.g. that
>>>>>> there's no strong reachability from the MH we're caching back to the
>>>>>> static ClassValue instance - because, if that would be the case I
>>>>>> think that would prevent class unloading.
>>>>>>
>>>>>> The problem is that the MethodHandle we cache refers to the stuct
>>>>>> impl
>>>>>> class, and I believe that class refers to some LayoutTypes on its
>>>>>> own,
>>>>>> which have a Reference inside, so it would be:
>>>>>>
>>>>>> ClassValue<MH> -> MH -> StructImpl -> LayoutType -> Reference ->
>>>>>> ClassValue<MH>
>>>>>>
>>>>>> Sundar can you double check?
>>>>>>
>>>>>> Maurizio
>>>>>>
>>>>>> On 22/05/2019 10:56, Jorn Vernee wrote:
>>>>>>> Good suggestion! This solves the problem, is nice and simple,
>>>>>>> and keeps the same times in the benchmark.
>>>>>>>
>>>>>>> Updated webrev:
>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.02/
>>>>>>>
>>>>>>> (only changes to References.java)
>>>>>>>
>>>>>>> I've added a test for the failure. I think that can be included
>>>>>>> as well? I re-ran the samples I have as well, and this time it's
>>>>>>> all green.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jorn
>>>>>>>
>>>>>>> Maurizio Cimadamore schreef op 2019-05-22 01:15:
>>>>>>>> On 21/05/2019 20:16, Jorn Vernee wrote:
>>>>>>>>> Although, now that you bring it up, I tried re-running some of
>>>>>>>>> the samples (hadn't done that yet), and I'm seeing some
>>>>>>>>> infinite recursion. This is seemingly caused by a circular
>>>>>>>>> type reference (e.g. linked list). i.e. to spin the impl of an
>>>>>>>>> accessor we need the LayoutType of the struct itself, which
>>>>>>>>> then tries to spin the impl again, and so on. I guess this
>>>>>>>>> isn't a test case in our suite yet...
>>>>>>>>>
>>>>>>>>> I'll look into this.
>>>>>>>>
>>>>>>>> Good detective work! I guess it would make sense to try and
>>>>>>>> reduce it
>>>>>>>> down to a simpler test, and push the test first.
>>>>>>>>
>>>>>>>> Where I was going with this is - your patch effectively made
>>>>>>>> the lazy
>>>>>>>> resolution inside StructImplGenerator useless. If we really
>>>>>>>> want to
>>>>>>>> explore that option, then we should, I think, remove all lazy
>>>>>>>> resolution sites and see what happens. It is possible that we
>>>>>>>> don't
>>>>>>>> rely so much on laziness as we did in the past (we did some
>>>>>>>> fixes few
>>>>>>>> months ago which stabilized resolution quite a bit) - in which
>>>>>>>> case we
>>>>>>>> can remove the resolution requests, although - I have to admit
>>>>>>>> - I'm a
>>>>>>>> bit skeptical. After all all you need it something like this
>>>>>>>> (as you
>>>>>>>> say):
>>>>>>>>
>>>>>>>> struct foo {
>>>>>>>> struct foo *next;
>>>>>>>> }
>>>>>>>>
>>>>>>>> Which is kind of the killer app for unresolved layouts in the
>>>>>>>> first place.
>>>>>>>>
>>>>>>>> This is translated into a struct interface which has a getter of
>>>>>>>> Pointer<foo>. To generate the getter you need to compute its
>>>>>>>> LayoutType which is a pointer LayoutType, so you have to
>>>>>>>> compute the
>>>>>>>> pointee LayoutType which brings you back where you started (the
>>>>>>>> whole
>>>>>>>> 'foo' LayoutType). In other words, since now the creation of
>>>>>>>> LayoutType<foo> requires the generation of the struct impl for
>>>>>>>> 'foo'
>>>>>>>> and since that depends (indirectly, through the pointer getter) on
>>>>>>>> being able to produce a LayoutType<foo>, you get a circularity.
>>>>>>>>
>>>>>>>> One thing we could try is - instead of eagerly creating the struct
>>>>>>>> impl, why don't we let the Reference.OfStruct having some mutable
>>>>>>>> state in it? That is, we could start off with Reference getter
>>>>>>>> which
>>>>>>>> does the expensive refelective lookup - but then, once it has
>>>>>>>> discovered the constructor MH, it can stash it in some field
>>>>>>>> (which is
>>>>>>>> private to that reference object) and use it later if the
>>>>>>>> getter is
>>>>>>>> used again. Then, you probably still need a ClassValue to stash a
>>>>>>>> mapping between a Class and its Reference.OfStruct; but it
>>>>>>>> seems like
>>>>>>>> this could fit in more naturally?
>>>>>>>>
>>>>>>>> Maurizio
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Jorn
>>>>>>>>>
>>>>>>>>> Jorn Vernee schreef op 2019-05-21 21:06:
>>>>>>>>>> Since we have the resolution context for NativeHeader, AFAIK
>>>>>>>>>> there is
>>>>>>>>>> no more difference between the resolution call done by
>>>>>>>>>> StructImpleGenerator, and the one done by
>>>>>>>>>> LayoutTypeImpl.ofStruct. So
>>>>>>>>>> I don't think there are any more cases where we would have
>>>>>>>>>> succeeded
>>>>>>>>>> to resolve the Struct layout be delaying spinning the impl.
>>>>>>>>>> At least
>>>>>>>>>> the tests haven't caught such a case.
>>>>>>>>>>
>>>>>>>>>> The other thing is that the partial layout for the getter is
>>>>>>>>>> caught in
>>>>>>>>>> StructImplGenerator, but for the setter it's caught when calling
>>>>>>>>>> bitSize on Unresolved. Saying layouts should be able to be
>>>>>>>>>> resolved
>>>>>>>>>> when calling LayoutType.ofStruct means we can use
>>>>>>>>>> References.OfGrumpy,
>>>>>>>>>> which makes the two more uniform.
>>>>>>>>>>
>>>>>>>>>> I have some ideas for keeping the lazy init semantics, but
>>>>>>>>>> it's a bit
>>>>>>>>>> more complex (using a MutableCallSite to mimic indy), and I'm
>>>>>>>>>> not sure
>>>>>>>>>> it will work as well.
>>>>>>>>>>
>>>>>>>>>> And, well, there was some talk about eagerly spinning the
>>>>>>>>>> implementations any ways :)
>>>>>>>>>>
>>>>>>>>>> Jorn
>>>>>>>>>>
>>>>>>>>>> Maurizio Cimadamore schreef op 2019-05-21 20:09:
>>>>>>>>>>> Looks good, although I'm a bit worried about the change in
>>>>>>>>>>> semantics
>>>>>>>>>>> w.r.t. eager instantiation. The binder will create a lot of
>>>>>>>>>>> LayoutTypes when generating the implementation - I wonder
>>>>>>>>>>> there were
>>>>>>>>>>> cases before where we created a partial layout type, which
>>>>>>>>>>> then got
>>>>>>>>>>> resolved correctly by the time it was dereferenced (since we do
>>>>>>>>>>> another resolve lazily in StructImplGenerator [1]).
>>>>>>>>>>>
>>>>>>>>>>> [1] -
>>>>>>>>>>> http://hg.openjdk.java.net/panama/dev/file/5ea3089be5ac/src/java.base/share/classes/jdk/internal/foreign/StructImplGenerator.java#l52
>>>>>>>>>>> On 21/05/2019 14:41, Jorn Vernee wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> After the recent string of benchmarking [1], I've arrived
>>>>>>>>>>>> at 2 optimizations to improve the speed of the measured
>>>>>>>>>>>> code path.
>>>>>>>>>>>>
>>>>>>>>>>>> 1.) Specialization of Struct getter MethodHandles per
>>>>>>>>>>>> struct class.
>>>>>>>>>>>> 2.) Implementation of RuntimeSupport::casterImpl that does
>>>>>>>>>>>> a fused cast and offset operation, to avoid creating
>>>>>>>>>>>> multiple Pointer objects.
>>>>>>>>>>>>
>>>>>>>>>>>> The benchmark:
>>>>>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/bench/webrev.00/
>>>>>>>>>>>> The optimizations:
>>>>>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.00/
>>>>>>>>>>>>
>>>>>>>>>>>> I've split these into 2 so that it's easier to run the
>>>>>>>>>>>> benchmarks with and without the optimizations. (benchmark
>>>>>>>>>>>> uses the OpenJDK's builtin framework [2]).
>>>>>>>>>>>>
>>>>>>>>>>>> Since we're now more eagerly instantiating the struct impl
>>>>>>>>>>>> class I had to work around partial struct types, since
>>>>>>>>>>>> spinning the impl requires a non-partial type and now we're
>>>>>>>>>>>> spinning the impl when creating the LayouType for the
>>>>>>>>>>>> struct, as opposed to on the first dereference. To do this
>>>>>>>>>>>> I'm detecting whether the struct is partial in
>>>>>>>>>>>> LayoutType.ofStruct, and using a Reference.OfGrumpy in the
>>>>>>>>>>>> case where it can not be resolved. Tbh, I think this makes
>>>>>>>>>>>> things a little more clear as well as far as where/how the
>>>>>>>>>>>> exception for deref of a partial type is thrown.
>>>>>>>>>>>>
>>>>>>>>>>>> Results on my machine before the optimization are:
>>>>>>>>>>>>
>>>>>>>>>>>> Benchmark Mode Cnt Score Error Units
>>>>>>>>>>>> GetStruct.jni_baseline avgt 50 14.204 ▒ 0.566
>>>>>>>>>>>> ns/op
>>>>>>>>>>>> GetStruct.panama_get_both avgt 50 507.638 ▒ 19.462
>>>>>>>>>>>> ns/op
>>>>>>>>>>>> GetStruct.panama_get_fieldonly avgt 50 90.236 ▒ 11.027
>>>>>>>>>>>> ns/op
>>>>>>>>>>>> GetStruct.panama_get_structonly avgt 50 370.783 ▒ 13.744
>>>>>>>>>>>> ns/op
>>>>>>>>>>>>
>>>>>>>>>>>> And after:
>>>>>>>>>>>>
>>>>>>>>>>>> Benchmark Mode Cnt Score Error Units
>>>>>>>>>>>> GetStruct.jni_baseline avgt 50 13.941 ▒ 0.485
>>>>>>>>>>>> ns/op
>>>>>>>>>>>> GetStruct.panama_get_both avgt 50 41.199 ▒ 1.632
>>>>>>>>>>>> ns/op
>>>>>>>>>>>> GetStruct.panama_get_fieldonly avgt 50 33.432 ▒ 1.889
>>>>>>>>>>>> ns/op
>>>>>>>>>>>> GetStruct.panama_get_structonly avgt 50 13.469 ▒ 0.781
>>>>>>>>>>>> ns/op
>>>>>>>>>>>>
>>>>>>>>>>>> Where panama_get_structonly corresponds to 1., and
>>>>>>>>>>>> panama_get_fieldonly corresponds to 2. For a total of about
>>>>>>>>>>>> 12x speedup.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Jorn
>>>>>>>>>>>>
>>>>>>>>>>>> [1] :
>>>>>>>>>>>> https://mail.openjdk.java.net/pipermail/panama-dev/2019-May/005469.html
>>>>>>>>>>>> [2] : https://openjdk.java.net/jeps/230
More information about the panama-dev
mailing list