[foreign] RFR 8224481: Optimize struct getter and field getter paths.

Thu May 23 14:13:23 UTC 2019

Thanks for the reviews! I went with 'specializedGetter' and 
'makeSpecializedGetter' (the latter to be more distinct from the 
former), and pushed.

Cheers,
Jorn

Maurizio Cimadamore schreef op 2019-05-23 15:20:
> I like the new webrev - only minor quibble on naming - that is, the
> 'getter' field and the 'makeGetter' method would probably be better
> named as 'specializedGetter' and
> 'specializeGetter'/'makeSpecializedGetter' respectively, to carry more
> meaning.
> 
> No need for another review if you decide to go for the name change.
> 
> Cheers
> Maurizio
> 
> On 23/05/2019 13:38, Jorn Vernee wrote:
>> Response inline....
>> 
>> Maurizio Cimadamore schreef op 2019-05-23 00:28:
>>> I did some more analysis on the ClassValue issue and I'm now 
>>> convinced
>>> that what we are doing is _not_ problematic.
>>> 
>>> What we really care about here is that, if we create a
>>> Reference.OfStruct for class Foo, we don't want the ClassValue we're
>>> using to cache that reference to prevent unloading of Foo. That is
>>> slightly different problem than the one described in [1]. There, the
>>> issue is that the storage associated with ClassValue (which lives
>>> inside Class objects) keeps growing indefinitively, in case where the
>>> computed values keep strong references to the ClassValue itself. This
>>> is due to the way in which ClassValue behaves.
>>> 
>>> A ClassValue is not an ordinary map - rather, when you call 'get' on 
>>> a
>>> ClassValue with a given class C, you really ask the Class object for 
>>> C
>>> for its ClassValue storage (a so called ClassValueMap abstraction).
>>> This map is essentially a WeakHashMap<Identity, Entry>, where 
>>> Identity
>>> is a field of the ClassValue uniquely identifying it, whereas Entry
>>> contains the computed value associated with that class and ClassValue
>>> instance (the Entry class has a lot of extra complexity to deal with
>>> versioning, which is irrelevant here).
>>> 
>>> So, if the ClassValue instance goes away, the fact that we're using a
>>> WeakHashMap here, allows the map to shrink in size. Of course, for
>>> this to happen, you don't want to have a strong reference from Entry
>>> (that is, from the computed value) back to the ClassValue instance -
>>> as in doing so you will prevent collection of the WeakHashMap 
>>> entries.
>>> 
>>> The bug in [1] shows that, when that happens, it is essentially
>>> possible to grow the WeakHashMap attached to a class object at will,
>>> until an OOME is produced.
>>> 
>>> But in our case we're not concerned with the fact that we keep adding
>>> multiple ClassValue to the _same_ class object; it's actually the
>>> opposite: we have a single ClassValue (in References.OfStruct), and
>>> many classes. In such a case, when the class goes away (because its
>>> classloader goes), it will just go away; there will be nothing
>>> preventing the collection of that class.
>>> 
>>> Attached is a test (with two files, Test.java and Dummy.java) - Test
>>> creates a new class loader, loads Dummy in it, and then stash a value
>>> for the Dummy class into a shared ClassValue. To make things as nasty
>>> as possible, the value we're storing has strong references to both 
>>> the
>>> Dummy class and the ClassValue itself. But, as soon as the loader is
>>> closed, the finalizer is run as expected and memory usage remains
>>> under control.
>> 
>> Thanks for the extensive research, and for explaining it! It's good to 
>> hear that using ClassValue won't be an issue for us.
>> 
>> I tried out the test, and I'm also seeing the finalizer being run.
>> 
>>> So, popping back to our enhancements, I think what the patch does is
>>> legit. In terms of the code, I don't like how the code made OfStruct
>>> _not_ a Reference, and is instead using OfStruct as a holder for some
>>> helper functionalities, plus the cache, whereas the real reference is
>>> an anonymous class generated inside the computeValue() method.
>>> 
>>> It seems to me that we could have Reference.OfStruct keep being a
>>> Reference, have a constructor that takes a Class object, and then 
>>> have
>>> a static ClassValue field in References which, upon computeValue
>>> creates a new instance of Reference.OfStruct for that class. I think
>>> the implementation would be a lot more linear that way (unless I'm
>>> missing something).
>> 
>> Yeah, I think doing that would make more sense. It would also help 
>> show what fields a struct Reference actually has.
>> 
>> I've also added @Stable to the MethodHandle field (as suggested in 
>> your other email) and re-ran the benchmark, but did not see an obvious 
>> performance increase. I looked at the profile for 
>> `panama_get_structonly`, but nothing really stands out to me:
>> 
>>  30.37%       c2, level 4 
>> org.sample.generated.GetStruct_panama_get_structonly_jmhTest::panama_get_structonly_avgt_jmhStub, 
>> version 746
>>  25.90%  Unknown, level 0 java.lang.invoke.MethodHandle::invokeBasic, 
>> version 102
>>  16.95%       c2, level 4 
>> java.lang.invoke.LambdaForm$MH.0x0000000800c0a840::invoke, version 713
>>  13.50%       c2, level 4 org.sample.GetStruct::panama_get_structonly, 
>> version 711
>> 
>> It looks like most time is spent on JMH overhead.
>> 
>> Updated webrev with your suggestions: 
>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.04/
>> 
>> (Only changes to References)
>> 
>> Jorn
>> 
>>> Cheers
>>> Maurizio
>>> 
>>> [1] - https://bugs.openjdk.java.net/browse/JDK-8136353
>>> 
>>> On 22/05/2019 16:30, Maurizio Cimadamore wrote:
>>>> Looks good - module pending questions on use of ClassValue.
>>>> 
>>>> I think we should come up with some kind of test case that shows the 
>>>> ClassValue issue and then test with different approaches
>>>> 
>>>> Maurizio
>>>> 
>>>> On 22/05/2019 16:09, Jorn Vernee wrote:
>>>>> Coming back to this once more,
>>>>> 
>>>>> I finally got my profiler working (after setting up a separate 
>>>>> project) and saw a lot of time spent getting the field offset:
>>>>> 
>>>>>  37.00%       c2, level 4 
>>>>> jdk.internal.foreign.LayoutPaths$$Lambda$66.0x0000000800c05040::getAsLong, 
>>>>> version 691
>>>>>  30.19%       c2, level 4 
>>>>> jdk.internal.foreign.RuntimeSupport::casterImpl, version 724
>>>>>  22.12%       c2, level 4 
>>>>> org.sample.generated.GetStruct_panama_get_fieldonly_jmhTest::panama_get_fieldonly_avgt_jmhStub, 
>>>>> version 746
>>>>>  ...
>>>>> 
>>>>> i.e. the call to LayoutPath.offset() in RuntimeSupport::casterImpl 
>>>>> can not be inlined, and we're re-computing the field offset over 
>>>>> and over again.
>>>>> 
>>>>> The fix for this is pretty simple; instead of passing the 
>>>>> LayoutPath to the caster, we pre-compute the offset and then pass 
>>>>> that. (This should be constant, right?).
>>>>> 
>>>>> This yields some more speedup:
>>>>> 
>>>>> Benchmark                        Mode  Cnt   Score   Error Units
>>>>> GetStruct.jni_baseline           avgt   50  13.337 ▒ 0.251 ns/op
>>>>> GetStruct.panama_get_both        avgt   50  17.026 ▒ 0.458 ns/op
>>>>> GetStruct.panama_get_fieldonly   avgt   50   7.796 ▒ 0.166 ns/op
>>>>> GetStruct.panama_get_structonly  avgt   50  11.863 ▒ 0.358 ns/op
>>>>> 
>>>>> Putting us pretty much even with jni_baseline.
>>>>> 
>>>>> Updated Webrev: 
>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.03/
>>>>> 
>>>>> (Only changes are to RuntimeSupport)
>>>>> 
>>>>> Cheers,
>>>>> Jorn
>>>>> 
>>>>> Jorn Vernee schreef op 2019-05-22 12:51:
>>>>>> Ah, good point.
>>>>>> 
>>>>>>> ClassValue<MH> -> MH -> StructImpl -> LayoutType -> Reference -> 
>>>>>>> ClassValue<MH>
>>>>>> 
>>>>>> I don't think that last link is quite right though. The LayoutType
>>>>>> references the anonymous Reference class, not References.OfStruct
>>>>>> (which contains the ClassValue).
>>>>>> 
>>>>>> I think it would be:
>>>>>> 
>>>>>> User Code -> LayoutType -> anonymous Reference -> getter MH ->
>>>>>> StructImpl -> LayoutType
>>>>>> 
>>>>>> There could still be a cycle there, but the whole cycle can be 
>>>>>> GC'd
>>>>>> once the reference from user code goes away.
>>>>>> 
>>>>>> Jorn
>>>>>> 
>>>>>> Maurizio Cimadamore schreef op 2019-05-22 12:37:
>>>>>>> Looks good - yesterday I was looking at this discussion:
>>>>>>> 
>>>>>>> http://mail.openjdk.java.net/pipermail/mlvm-dev/2016-January/006563.html 
>>>>>>> I hope we don't run in the condition described there - e.g. that
>>>>>>> there's no strong reachability from the MH we're caching back to 
>>>>>>> the
>>>>>>> static ClassValue instance - because, if that would be the case I
>>>>>>> think that would prevent class unloading.
>>>>>>> 
>>>>>>> The problem is that the MethodHandle we cache refers to the stuct 
>>>>>>> impl
>>>>>>> class, and I believe that class refers to some LayoutTypes on its 
>>>>>>> own,
>>>>>>> which have a Reference inside, so it would be:
>>>>>>> 
>>>>>>> ClassValue<MH> -> MH -> StructImpl -> LayoutType -> Reference -> 
>>>>>>> ClassValue<MH>
>>>>>>> 
>>>>>>> Sundar can you double check?
>>>>>>> 
>>>>>>> Maurizio
>>>>>>> 
>>>>>>> On 22/05/2019 10:56, Jorn Vernee wrote:
>>>>>>>> Good suggestion! This solves the problem, is nice and simple, 
>>>>>>>> and keeps the same times in the benchmark.
>>>>>>>> 
>>>>>>>> Updated webrev: 
>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.02/
>>>>>>>> 
>>>>>>>> (only changes to References.java)
>>>>>>>> 
>>>>>>>> I've added a test for the failure. I think that can be included 
>>>>>>>> as well? I re-ran the samples I have as well, and this time it's 
>>>>>>>> all green.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Jorn
>>>>>>>> 
>>>>>>>> Maurizio Cimadamore schreef op 2019-05-22 01:15:
>>>>>>>>> On 21/05/2019 20:16, Jorn Vernee wrote:
>>>>>>>>>> Although, now that you bring it up, I tried re-running some of 
>>>>>>>>>> the samples (hadn't done that yet), and I'm seeing some 
>>>>>>>>>> infinite recursion. This is seemingly caused by a circular 
>>>>>>>>>> type reference (e.g. linked list). i.e. to spin the impl of an 
>>>>>>>>>> accessor we need the LayoutType of the struct itself, which 
>>>>>>>>>> then tries to spin the impl again, and so on. I guess this 
>>>>>>>>>> isn't a test case in our suite yet...
>>>>>>>>>> 
>>>>>>>>>> I'll look into this.
>>>>>>>>> 
>>>>>>>>> Good detective work! I guess it would make sense to try and 
>>>>>>>>> reduce it
>>>>>>>>> down to a simpler test, and push the test first.
>>>>>>>>> 
>>>>>>>>> Where I was going with this is - your patch effectively made 
>>>>>>>>> the lazy
>>>>>>>>> resolution inside StructImplGenerator useless. If we really 
>>>>>>>>> want to
>>>>>>>>> explore that option, then we should, I think, remove all lazy
>>>>>>>>> resolution sites and see what happens. It is possible that we 
>>>>>>>>> don't
>>>>>>>>> rely so much on laziness as we did in the past (we did some 
>>>>>>>>> fixes few
>>>>>>>>> months ago which stabilized resolution quite a bit) - in which 
>>>>>>>>> case we
>>>>>>>>> can remove the resolution requests, although - I have to admit 
>>>>>>>>> - I'm a
>>>>>>>>> bit skeptical. After all all you need it something like this 
>>>>>>>>> (as you
>>>>>>>>> say):
>>>>>>>>> 
>>>>>>>>> struct foo {
>>>>>>>>>     struct foo *next;
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> Which is kind of the killer app for unresolved layouts in the 
>>>>>>>>> first place.
>>>>>>>>> 
>>>>>>>>> This is translated into a struct interface which has a getter 
>>>>>>>>> of
>>>>>>>>> Pointer<foo>. To generate the getter you need to compute its
>>>>>>>>> LayoutType which is a pointer LayoutType, so you have to 
>>>>>>>>> compute the
>>>>>>>>> pointee LayoutType which brings you back where you started (the 
>>>>>>>>> whole
>>>>>>>>> 'foo' LayoutType). In other words, since now the creation of
>>>>>>>>> LayoutType<foo> requires the generation of the struct impl for 
>>>>>>>>> 'foo'
>>>>>>>>> and since that depends (indirectly, through the pointer getter) 
>>>>>>>>> on
>>>>>>>>> being able to produce a LayoutType<foo>, you get a circularity.
>>>>>>>>> 
>>>>>>>>> One thing we could try is - instead of eagerly creating the 
>>>>>>>>> struct
>>>>>>>>> impl, why don't we let the Reference.OfStruct having some 
>>>>>>>>> mutable
>>>>>>>>> state in it? That is, we could start off with Reference getter 
>>>>>>>>> which
>>>>>>>>> does the expensive refelective lookup - but then, once it has
>>>>>>>>> discovered the constructor MH, it can stash it in some field 
>>>>>>>>> (which is
>>>>>>>>> private to that reference object) and use it later if the 
>>>>>>>>> getter is
>>>>>>>>> used again. Then, you probably still need a ClassValue to stash 
>>>>>>>>> a
>>>>>>>>> mapping between a Class and its Reference.OfStruct; but it 
>>>>>>>>> seems like
>>>>>>>>> this could fit in more naturally?
>>>>>>>>> 
>>>>>>>>> Maurizio
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Jorn
>>>>>>>>>> 
>>>>>>>>>> Jorn Vernee schreef op 2019-05-21 21:06:
>>>>>>>>>>> Since we have the resolution context for NativeHeader, AFAIK 
>>>>>>>>>>> there is
>>>>>>>>>>> no more difference between the resolution call done by
>>>>>>>>>>> StructImpleGenerator, and the one done by 
>>>>>>>>>>> LayoutTypeImpl.ofStruct. So
>>>>>>>>>>> I don't think there are any more cases where we would have 
>>>>>>>>>>> succeeded
>>>>>>>>>>> to resolve the Struct layout be delaying spinning the impl. 
>>>>>>>>>>> At least
>>>>>>>>>>> the tests haven't caught such a case.
>>>>>>>>>>> 
>>>>>>>>>>> The other thing is that the partial layout for the getter is 
>>>>>>>>>>> caught in
>>>>>>>>>>> StructImplGenerator, but for the setter it's caught when 
>>>>>>>>>>> calling
>>>>>>>>>>> bitSize on Unresolved. Saying layouts should be able to be 
>>>>>>>>>>> resolved
>>>>>>>>>>> when calling LayoutType.ofStruct means we can use 
>>>>>>>>>>> References.OfGrumpy,
>>>>>>>>>>> which makes the two more uniform.
>>>>>>>>>>> 
>>>>>>>>>>> I have some ideas for keeping the lazy init semantics, but 
>>>>>>>>>>> it's a bit
>>>>>>>>>>> more complex (using a MutableCallSite to mimic indy), and I'm 
>>>>>>>>>>> not sure
>>>>>>>>>>> it will work as well.
>>>>>>>>>>> 
>>>>>>>>>>> And, well, there was some talk about eagerly spinning the
>>>>>>>>>>> implementations any ways :)
>>>>>>>>>>> 
>>>>>>>>>>> Jorn
>>>>>>>>>>> 
>>>>>>>>>>> Maurizio Cimadamore schreef op 2019-05-21 20:09:
>>>>>>>>>>>> Looks good, although I'm a bit worried about the change in 
>>>>>>>>>>>> semantics
>>>>>>>>>>>> w.r.t. eager instantiation. The binder will create a lot of
>>>>>>>>>>>> LayoutTypes when generating the implementation - I wonder 
>>>>>>>>>>>> there were
>>>>>>>>>>>> cases before where we created a partial layout type, which 
>>>>>>>>>>>> then got
>>>>>>>>>>>> resolved correctly by the time it was dereferenced (since we 
>>>>>>>>>>>> do
>>>>>>>>>>>> another resolve lazily in StructImplGenerator [1]).
>>>>>>>>>>>> 
>>>>>>>>>>>> [1] -
>>>>>>>>>>>> http://hg.openjdk.java.net/panama/dev/file/5ea3089be5ac/src/java.base/share/classes/jdk/internal/foreign/StructImplGenerator.java#l52 
>>>>>>>>>>>> On 21/05/2019 14:41, Jorn Vernee wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> After the recent string of benchmarking [1], I've arrived 
>>>>>>>>>>>>> at 2 optimizations to improve the speed of the measured 
>>>>>>>>>>>>> code path.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1.) Specialization of Struct getter MethodHandles per 
>>>>>>>>>>>>> struct class.
>>>>>>>>>>>>> 2.) Implementation of RuntimeSupport::casterImpl that does 
>>>>>>>>>>>>> a fused cast and offset operation, to avoid creating 
>>>>>>>>>>>>> multiple Pointer objects.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The benchmark: 
>>>>>>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/bench/webrev.00/
>>>>>>>>>>>>> The optimizations: 
>>>>>>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.00/
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I've split these into 2 so that it's easier to run the 
>>>>>>>>>>>>> benchmarks with and without the optimizations. (benchmark 
>>>>>>>>>>>>> uses the OpenJDK's builtin framework [2]).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Since we're now more eagerly instantiating the struct impl 
>>>>>>>>>>>>> class I had to work around partial struct types, since 
>>>>>>>>>>>>> spinning the impl requires a non-partial type and now we're 
>>>>>>>>>>>>> spinning the impl when creating the LayouType for the 
>>>>>>>>>>>>> struct, as opposed to on the first dereference. To do this 
>>>>>>>>>>>>> I'm detecting whether the struct is partial in 
>>>>>>>>>>>>> LayoutType.ofStruct, and using a Reference.OfGrumpy in the 
>>>>>>>>>>>>> case where it can not be resolved. Tbh, I think this makes 
>>>>>>>>>>>>> things a little more clear as well as far as where/how the 
>>>>>>>>>>>>> exception for deref of a partial type is thrown.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Results on my machine before the optimization are:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Benchmark                        Mode  Cnt Score Error 
>>>>>>>>>>>>> Units
>>>>>>>>>>>>> GetStruct.jni_baseline           avgt   50 14.204 ▒ 0.566 
>>>>>>>>>>>>> ns/op
>>>>>>>>>>>>> GetStruct.panama_get_both        avgt   50 507.638 ▒ 19.462 
>>>>>>>>>>>>> ns/op
>>>>>>>>>>>>> GetStruct.panama_get_fieldonly   avgt   50 90.236 ▒ 11.027 
>>>>>>>>>>>>> ns/op
>>>>>>>>>>>>> GetStruct.panama_get_structonly  avgt   50 370.783 ▒ 13.744 
>>>>>>>>>>>>> ns/op
>>>>>>>>>>>>> 
>>>>>>>>>>>>> And after:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Benchmark                        Mode  Cnt Score Error 
>>>>>>>>>>>>> Units
>>>>>>>>>>>>> GetStruct.jni_baseline           avgt   50 13.941 ▒ 0.485 
>>>>>>>>>>>>> ns/op
>>>>>>>>>>>>> GetStruct.panama_get_both        avgt   50 41.199 ▒ 1.632 
>>>>>>>>>>>>> ns/op
>>>>>>>>>>>>> GetStruct.panama_get_fieldonly   avgt   50 33.432 ▒ 1.889 
>>>>>>>>>>>>> ns/op
>>>>>>>>>>>>> GetStruct.panama_get_structonly  avgt   50 13.469 ▒ 0.781 
>>>>>>>>>>>>> ns/op
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Where panama_get_structonly corresponds to 1., and 
>>>>>>>>>>>>> panama_get_fieldonly corresponds to 2. For a total of about 
>>>>>>>>>>>>> 12x speedup.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>> 
>>>>>>>>>>>>> [1] : 
>>>>>>>>>>>>> https://mail.openjdk.java.net/pipermail/panama-dev/2019-May/005469.html 
>>>>>>>>>>>>> [2] : https://openjdk.java.net/jeps/230