[foreign] RFR 8224481: Optimize struct getter and field getter paths.

Maurizio Cimadamore maurizio.cimadamore at oracle.com
Wed May 22 15:30:37 UTC 2019


Looks good - module pending questions on use of ClassValue.

I think we should come up with some kind of test case that shows the 
ClassValue issue and then test with different approaches

Maurizio

On 22/05/2019 16:09, Jorn Vernee wrote:
> Coming back to this once more,
>
> I finally got my profiler working (after setting up a separate 
> project) and saw a lot of time spent getting the field offset:
>
>  37.00%       c2, level 4 
> jdk.internal.foreign.LayoutPaths$$Lambda$66.0x0000000800c05040::getAsLong, 
> version 691
>  30.19%       c2, level 4 
> jdk.internal.foreign.RuntimeSupport::casterImpl, version 724
>  22.12%       c2, level 4 
> org.sample.generated.GetStruct_panama_get_fieldonly_jmhTest::panama_get_fieldonly_avgt_jmhStub, 
> version 746
>  ...
>
> i.e. the call to LayoutPath.offset() in RuntimeSupport::casterImpl can 
> not be inlined, and we're re-computing the field offset over and over 
> again.
>
> The fix for this is pretty simple; instead of passing the LayoutPath 
> to the caster, we pre-compute the offset and then pass that. (This 
> should be constant, right?).
>
> This yields some more speedup:
>
> Benchmark                        Mode  Cnt   Score   Error  Units
> GetStruct.jni_baseline           avgt   50  13.337 ▒ 0.251  ns/op
> GetStruct.panama_get_both        avgt   50  17.026 ▒ 0.458  ns/op
> GetStruct.panama_get_fieldonly   avgt   50   7.796 ▒ 0.166  ns/op
> GetStruct.panama_get_structonly  avgt   50  11.863 ▒ 0.358  ns/op
>
> Putting us pretty much even with jni_baseline.
>
> Updated Webrev: 
> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.03/
>
> (Only changes are to RuntimeSupport)
>
> Cheers,
> Jorn
>
> Jorn Vernee schreef op 2019-05-22 12:51:
>> Ah, good point.
>>
>>> ClassValue<MH> -> MH -> StructImpl -> LayoutType -> Reference -> 
>>> ClassValue<MH>
>>
>> I don't think that last link is quite right though. The LayoutType
>> references the anonymous Reference class, not References.OfStruct
>> (which contains the ClassValue).
>>
>> I think it would be:
>>
>> User Code -> LayoutType -> anonymous Reference -> getter MH ->
>> StructImpl -> LayoutType
>>
>> There could still be a cycle there, but the whole cycle can be GC'd
>> once the reference from user code goes away.
>>
>> Jorn
>>
>> Maurizio Cimadamore schreef op 2019-05-22 12:37:
>>> Looks good - yesterday I was looking at this discussion:
>>>
>>> http://mail.openjdk.java.net/pipermail/mlvm-dev/2016-January/006563.html 
>>>
>>>
>>> I hope we don't run in the condition described there - e.g. that
>>> there's no strong reachability from the MH we're caching back to the
>>> static ClassValue instance - because, if that would be the case I
>>> think that would prevent class unloading.
>>>
>>> The problem is that the MethodHandle we cache refers to the stuct impl
>>> class, and I believe that class refers to some LayoutTypes on its own,
>>> which have a Reference inside, so it would be:
>>>
>>> ClassValue<MH> -> MH -> StructImpl -> LayoutType -> Reference -> 
>>> ClassValue<MH>
>>>
>>> Sundar can you double check?
>>>
>>> Maurizio
>>>
>>> On 22/05/2019 10:56, Jorn Vernee wrote:
>>>> Good suggestion! This solves the problem, is nice and simple, and 
>>>> keeps the same times in the benchmark.
>>>>
>>>> Updated webrev: 
>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.02/
>>>>
>>>> (only changes to References.java)
>>>>
>>>> I've added a test for the failure. I think that can be included as 
>>>> well? I re-ran the samples I have as well, and this time it's all 
>>>> green.
>>>>
>>>> Thanks,
>>>> Jorn
>>>>
>>>> Maurizio Cimadamore schreef op 2019-05-22 01:15:
>>>>> On 21/05/2019 20:16, Jorn Vernee wrote:
>>>>>> Although, now that you bring it up, I tried re-running some of 
>>>>>> the samples (hadn't done that yet), and I'm seeing some infinite 
>>>>>> recursion. This is seemingly caused by a circular type reference 
>>>>>> (e.g. linked list). i.e. to spin the impl of an accessor we need 
>>>>>> the LayoutType of the struct itself, which then tries to spin the 
>>>>>> impl again, and so on. I guess this isn't a test case in our 
>>>>>> suite yet...
>>>>>>
>>>>>> I'll look into this.
>>>>>
>>>>> Good detective work! I guess it would make sense to try and reduce it
>>>>> down to a simpler test, and push the test first.
>>>>>
>>>>> Where I was going with this is - your patch effectively made the lazy
>>>>> resolution inside StructImplGenerator useless. If we really want to
>>>>> explore that option, then we should, I think, remove all lazy
>>>>> resolution sites and see what happens. It is possible that we don't
>>>>> rely so much on laziness as we did in the past (we did some fixes few
>>>>> months ago which stabilized resolution quite a bit) - in which 
>>>>> case we
>>>>> can remove the resolution requests, although - I have to admit - 
>>>>> I'm a
>>>>> bit skeptical. After all all you need it something like this (as you
>>>>> say):
>>>>>
>>>>> struct foo {
>>>>>     struct foo *next;
>>>>> }
>>>>>
>>>>> Which is kind of the killer app for unresolved layouts in the 
>>>>> first place.
>>>>>
>>>>> This is translated into a struct interface which has a getter of
>>>>> Pointer<foo>. To generate the getter you need to compute its
>>>>> LayoutType which is a pointer LayoutType, so you have to compute the
>>>>> pointee LayoutType which brings you back where you started (the whole
>>>>> 'foo' LayoutType). In other words, since now the creation of
>>>>> LayoutType<foo> requires the generation of the struct impl for 'foo'
>>>>> and since that depends (indirectly, through the pointer getter) on
>>>>> being able to produce a LayoutType<foo>, you get a circularity.
>>>>>
>>>>> One thing we could try is - instead of eagerly creating the struct
>>>>> impl, why don't we let the Reference.OfStruct having some mutable
>>>>> state in it? That is, we could start off with Reference getter which
>>>>> does the expensive refelective lookup - but then, once it has
>>>>> discovered the constructor MH, it can stash it in some field 
>>>>> (which is
>>>>> private to that reference object) and use it later if the getter is
>>>>> used again. Then, you probably still need a ClassValue to stash a
>>>>> mapping between a Class and its Reference.OfStruct; but it seems like
>>>>> this could fit in more naturally?
>>>>>
>>>>> Maurizio
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Jorn
>>>>>>
>>>>>> Jorn Vernee schreef op 2019-05-21 21:06:
>>>>>>> Since we have the resolution context for NativeHeader, AFAIK 
>>>>>>> there is
>>>>>>> no more difference between the resolution call done by
>>>>>>> StructImpleGenerator, and the one done by 
>>>>>>> LayoutTypeImpl.ofStruct. So
>>>>>>> I don't think there are any more cases where we would have 
>>>>>>> succeeded
>>>>>>> to resolve the Struct layout be delaying spinning the impl. At 
>>>>>>> least
>>>>>>> the tests haven't caught such a case.
>>>>>>>
>>>>>>> The other thing is that the partial layout for the getter is 
>>>>>>> caught in
>>>>>>> StructImplGenerator, but for the setter it's caught when calling
>>>>>>> bitSize on Unresolved. Saying layouts should be able to be resolved
>>>>>>> when calling LayoutType.ofStruct means we can use 
>>>>>>> References.OfGrumpy,
>>>>>>> which makes the two more uniform.
>>>>>>>
>>>>>>> I have some ideas for keeping the lazy init semantics, but it's 
>>>>>>> a bit
>>>>>>> more complex (using a MutableCallSite to mimic indy), and I'm 
>>>>>>> not sure
>>>>>>> it will work as well.
>>>>>>>
>>>>>>> And, well, there was some talk about eagerly spinning the
>>>>>>> implementations any ways :)
>>>>>>>
>>>>>>> Jorn
>>>>>>>
>>>>>>> Maurizio Cimadamore schreef op 2019-05-21 20:09:
>>>>>>>> Looks good, although I'm a bit worried about the change in 
>>>>>>>> semantics
>>>>>>>> w.r.t. eager instantiation. The binder will create a lot of
>>>>>>>> LayoutTypes when generating the implementation - I wonder there 
>>>>>>>> were
>>>>>>>> cases before where we created a partial layout type, which then 
>>>>>>>> got
>>>>>>>> resolved correctly by the time it was dereferenced (since we do
>>>>>>>> another resolve lazily in StructImplGenerator [1]).
>>>>>>>>
>>>>>>>> [1] -
>>>>>>>> http://hg.openjdk.java.net/panama/dev/file/5ea3089be5ac/src/java.base/share/classes/jdk/internal/foreign/StructImplGenerator.java#l52 
>>>>>>>> On 21/05/2019 14:41, Jorn Vernee wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> After the recent string of benchmarking [1], I've arrived at 2 
>>>>>>>>> optimizations to improve the speed of the measured code path.
>>>>>>>>>
>>>>>>>>> 1.) Specialization of Struct getter MethodHandles per struct 
>>>>>>>>> class.
>>>>>>>>> 2.) Implementation of RuntimeSupport::casterImpl that does a 
>>>>>>>>> fused cast and offset operation, to avoid creating multiple 
>>>>>>>>> Pointer objects.
>>>>>>>>>
>>>>>>>>> The benchmark: 
>>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/bench/webrev.00/
>>>>>>>>> The optimizations: 
>>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.00/
>>>>>>>>>
>>>>>>>>> I've split these into 2 so that it's easier to run the 
>>>>>>>>> benchmarks with and without the optimizations. (benchmark uses 
>>>>>>>>> the OpenJDK's builtin framework [2]).
>>>>>>>>>
>>>>>>>>> Since we're now more eagerly instantiating the struct impl 
>>>>>>>>> class I had to work around partial struct types, since 
>>>>>>>>> spinning the impl requires a non-partial type and now we're 
>>>>>>>>> spinning the impl when creating the LayouType for the struct, 
>>>>>>>>> as opposed to on the first dereference. To do this I'm 
>>>>>>>>> detecting whether the struct is partial in 
>>>>>>>>> LayoutType.ofStruct, and using a Reference.OfGrumpy in the 
>>>>>>>>> case where it can not be resolved. Tbh, I think this makes 
>>>>>>>>> things a little more clear as well as far as where/how the 
>>>>>>>>> exception for deref of a partial type is thrown.
>>>>>>>>>
>>>>>>>>> Results on my machine before the optimization are:
>>>>>>>>>
>>>>>>>>> Benchmark                        Mode  Cnt Score Error Units
>>>>>>>>> GetStruct.jni_baseline           avgt   50 14.204 ▒ 0.566 ns/op
>>>>>>>>> GetStruct.panama_get_both        avgt   50 507.638 ▒ 19.462 ns/op
>>>>>>>>> GetStruct.panama_get_fieldonly   avgt   50 90.236 ▒ 11.027 ns/op
>>>>>>>>> GetStruct.panama_get_structonly  avgt   50 370.783 ▒ 13.744 ns/op
>>>>>>>>>
>>>>>>>>> And after:
>>>>>>>>>
>>>>>>>>> Benchmark                        Mode  Cnt   Score Error Units
>>>>>>>>> GetStruct.jni_baseline           avgt   50  13.941 ▒ 0.485 ns/op
>>>>>>>>> GetStruct.panama_get_both        avgt   50  41.199 ▒ 1.632 ns/op
>>>>>>>>> GetStruct.panama_get_fieldonly   avgt   50  33.432 ▒ 1.889 ns/op
>>>>>>>>> GetStruct.panama_get_structonly  avgt   50  13.469 ▒ 0.781 ns/op
>>>>>>>>>
>>>>>>>>> Where panama_get_structonly corresponds to 1., and 
>>>>>>>>> panama_get_fieldonly corresponds to 2. For a total of about 
>>>>>>>>> 12x speedup.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Jorn
>>>>>>>>>
>>>>>>>>> [1] : 
>>>>>>>>> https://mail.openjdk.java.net/pipermail/panama-dev/2019-May/005469.html 
>>>>>>>>> [2] : https://openjdk.java.net/jeps/230


More information about the panama-dev mailing list