[foreign] RFR 8224481: Optimize struct getter and field getter paths.

Jorn Vernee jbvernee at xs4all.nl
Wed May 22 15:09:29 UTC 2019


Coming back to this once more,

I finally got my profiler working (after setting up a separate project) 
and saw a lot of time spent getting the field offset:

  37.00%       c2, level 4  
jdk.internal.foreign.LayoutPaths$$Lambda$66.0x0000000800c05040::getAsLong, 
version 691
  30.19%       c2, level 4  
jdk.internal.foreign.RuntimeSupport::casterImpl, version 724
  22.12%       c2, level 4  
org.sample.generated.GetStruct_panama_get_fieldonly_jmhTest::panama_get_fieldonly_avgt_jmhStub, 
version 746
  ...

i.e. the call to LayoutPath.offset() in RuntimeSupport::casterImpl can 
not be inlined, and we're re-computing the field offset over and over 
again.

The fix for this is pretty simple; instead of passing the LayoutPath to 
the caster, we pre-compute the offset and then pass that. (This should 
be constant, right?).

This yields some more speedup:

Benchmark                        Mode  Cnt   Score   Error  Units
GetStruct.jni_baseline           avgt   50  13.337 ▒ 0.251  ns/op
GetStruct.panama_get_both        avgt   50  17.026 ▒ 0.458  ns/op
GetStruct.panama_get_fieldonly   avgt   50   7.796 ▒ 0.166  ns/op
GetStruct.panama_get_structonly  avgt   50  11.863 ▒ 0.358  ns/op

Putting us pretty much even with jni_baseline.

Updated Webrev: 
http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.03/

(Only changes are to RuntimeSupport)

Cheers,
Jorn

Jorn Vernee schreef op 2019-05-22 12:51:
> Ah, good point.
> 
>> ClassValue<MH> -> MH -> StructImpl -> LayoutType -> Reference -> 
>> ClassValue<MH>
> 
> I don't think that last link is quite right though. The LayoutType
> references the anonymous Reference class, not References.OfStruct
> (which contains the ClassValue).
> 
> I think it would be:
> 
> User Code -> LayoutType -> anonymous Reference -> getter MH ->
> StructImpl -> LayoutType
> 
> There could still be a cycle there, but the whole cycle can be GC'd
> once the reference from user code goes away.
> 
> Jorn
> 
> Maurizio Cimadamore schreef op 2019-05-22 12:37:
>> Looks good - yesterday I was looking at this discussion:
>> 
>> http://mail.openjdk.java.net/pipermail/mlvm-dev/2016-January/006563.html
>> 
>> I hope we don't run in the condition described there - e.g. that
>> there's no strong reachability from the MH we're caching back to the
>> static ClassValue instance - because, if that would be the case I
>> think that would prevent class unloading.
>> 
>> The problem is that the MethodHandle we cache refers to the stuct impl
>> class, and I believe that class refers to some LayoutTypes on its own,
>> which have a Reference inside, so it would be:
>> 
>> ClassValue<MH> -> MH -> StructImpl -> LayoutType -> Reference -> 
>> ClassValue<MH>
>> 
>> Sundar can you double check?
>> 
>> Maurizio
>> 
>> On 22/05/2019 10:56, Jorn Vernee wrote:
>>> Good suggestion! This solves the problem, is nice and simple, and 
>>> keeps the same times in the benchmark.
>>> 
>>> Updated webrev: 
>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.02/
>>> 
>>> (only changes to References.java)
>>> 
>>> I've added a test for the failure. I think that can be included as 
>>> well? I re-ran the samples I have as well, and this time it's all 
>>> green.
>>> 
>>> Thanks,
>>> Jorn
>>> 
>>> Maurizio Cimadamore schreef op 2019-05-22 01:15:
>>>> On 21/05/2019 20:16, Jorn Vernee wrote:
>>>>> Although, now that you bring it up, I tried re-running some of the 
>>>>> samples (hadn't done that yet), and I'm seeing some infinite 
>>>>> recursion. This is seemingly caused by a circular type reference 
>>>>> (e.g. linked list). i.e. to spin the impl of an accessor we need 
>>>>> the LayoutType of the struct itself, which then tries to spin the 
>>>>> impl again, and so on. I guess this isn't a test case in our suite 
>>>>> yet...
>>>>> 
>>>>> I'll look into this.
>>>> 
>>>> Good detective work! I guess it would make sense to try and reduce 
>>>> it
>>>> down to a simpler test, and push the test first.
>>>> 
>>>> Where I was going with this is - your patch effectively made the 
>>>> lazy
>>>> resolution inside StructImplGenerator useless. If we really want to
>>>> explore that option, then we should, I think, remove all lazy
>>>> resolution sites and see what happens. It is possible that we don't
>>>> rely so much on laziness as we did in the past (we did some fixes 
>>>> few
>>>> months ago which stabilized resolution quite a bit) - in which case 
>>>> we
>>>> can remove the resolution requests, although - I have to admit - I'm 
>>>> a
>>>> bit skeptical. After all all you need it something like this (as you
>>>> say):
>>>> 
>>>> struct foo {
>>>>     struct foo *next;
>>>> }
>>>> 
>>>> Which is kind of the killer app for unresolved layouts in the first 
>>>> place.
>>>> 
>>>> This is translated into a struct interface which has a getter of
>>>> Pointer<foo>. To generate the getter you need to compute its
>>>> LayoutType which is a pointer LayoutType, so you have to compute the
>>>> pointee LayoutType which brings you back where you started (the 
>>>> whole
>>>> 'foo' LayoutType). In other words, since now the creation of
>>>> LayoutType<foo> requires the generation of the struct impl for 'foo'
>>>> and since that depends (indirectly, through the pointer getter) on
>>>> being able to produce a LayoutType<foo>, you get a circularity.
>>>> 
>>>> One thing we could try is - instead of eagerly creating the struct
>>>> impl, why don't we let the Reference.OfStruct having some mutable
>>>> state in it? That is, we could start off with Reference getter which
>>>> does the expensive refelective lookup - but then, once it has
>>>> discovered the constructor MH, it can stash it in some field (which 
>>>> is
>>>> private to that reference object) and use it later if the getter is
>>>> used again. Then, you probably still need a ClassValue to stash a
>>>> mapping between a Class and its Reference.OfStruct; but it seems 
>>>> like
>>>> this could fit in more naturally?
>>>> 
>>>> Maurizio
>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Jorn
>>>>> 
>>>>> Jorn Vernee schreef op 2019-05-21 21:06:
>>>>>> Since we have the resolution context for NativeHeader, AFAIK there 
>>>>>> is
>>>>>> no more difference between the resolution call done by
>>>>>> StructImpleGenerator, and the one done by LayoutTypeImpl.ofStruct. 
>>>>>> So
>>>>>> I don't think there are any more cases where we would have 
>>>>>> succeeded
>>>>>> to resolve the Struct layout be delaying spinning the impl. At 
>>>>>> least
>>>>>> the tests haven't caught such a case.
>>>>>> 
>>>>>> The other thing is that the partial layout for the getter is 
>>>>>> caught in
>>>>>> StructImplGenerator, but for the setter it's caught when calling
>>>>>> bitSize on Unresolved. Saying layouts should be able to be 
>>>>>> resolved
>>>>>> when calling LayoutType.ofStruct means we can use 
>>>>>> References.OfGrumpy,
>>>>>> which makes the two more uniform.
>>>>>> 
>>>>>> I have some ideas for keeping the lazy init semantics, but it's a 
>>>>>> bit
>>>>>> more complex (using a MutableCallSite to mimic indy), and I'm not 
>>>>>> sure
>>>>>> it will work as well.
>>>>>> 
>>>>>> And, well, there was some talk about eagerly spinning the
>>>>>> implementations any ways :)
>>>>>> 
>>>>>> Jorn
>>>>>> 
>>>>>> Maurizio Cimadamore schreef op 2019-05-21 20:09:
>>>>>>> Looks good, although I'm a bit worried about the change in 
>>>>>>> semantics
>>>>>>> w.r.t. eager instantiation. The binder will create a lot of
>>>>>>> LayoutTypes when generating the implementation - I wonder there 
>>>>>>> were
>>>>>>> cases before where we created a partial layout type, which then 
>>>>>>> got
>>>>>>> resolved correctly by the time it was dereferenced (since we do
>>>>>>> another resolve lazily in StructImplGenerator [1]).
>>>>>>> 
>>>>>>> [1] -
>>>>>>> http://hg.openjdk.java.net/panama/dev/file/5ea3089be5ac/src/java.base/share/classes/jdk/internal/foreign/StructImplGenerator.java#l52 
>>>>>>> On 21/05/2019 14:41, Jorn Vernee wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> After the recent string of benchmarking [1], I've arrived at 2 
>>>>>>>> optimizations to improve the speed of the measured code path.
>>>>>>>> 
>>>>>>>> 1.) Specialization of Struct getter MethodHandles per struct 
>>>>>>>> class.
>>>>>>>> 2.) Implementation of RuntimeSupport::casterImpl that does a 
>>>>>>>> fused cast and offset operation, to avoid creating multiple 
>>>>>>>> Pointer objects.
>>>>>>>> 
>>>>>>>> The benchmark: 
>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/bench/webrev.00/
>>>>>>>> The optimizations: 
>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8224481/opto/webrev.00/
>>>>>>>> 
>>>>>>>> I've split these into 2 so that it's easier to run the 
>>>>>>>> benchmarks with and without the optimizations. (benchmark uses 
>>>>>>>> the OpenJDK's builtin framework [2]).
>>>>>>>> 
>>>>>>>> Since we're now more eagerly instantiating the struct impl class 
>>>>>>>> I had to work around partial struct types, since spinning the 
>>>>>>>> impl requires a non-partial type and now we're spinning the impl 
>>>>>>>> when creating the LayouType for the struct, as opposed to on the 
>>>>>>>> first dereference. To do this I'm detecting whether the struct 
>>>>>>>> is partial in LayoutType.ofStruct, and using a 
>>>>>>>> Reference.OfGrumpy in the case where it can not be resolved. 
>>>>>>>> Tbh, I think this makes things a little more clear as well as 
>>>>>>>> far as where/how the exception for deref of a partial type is 
>>>>>>>> thrown.
>>>>>>>> 
>>>>>>>> Results on my machine before the optimization are:
>>>>>>>> 
>>>>>>>> Benchmark                        Mode  Cnt    Score Error Units
>>>>>>>> GetStruct.jni_baseline           avgt   50   14.204 ▒ 0.566 
>>>>>>>> ns/op
>>>>>>>> GetStruct.panama_get_both        avgt   50  507.638 ▒ 19.462 
>>>>>>>> ns/op
>>>>>>>> GetStruct.panama_get_fieldonly   avgt   50   90.236 ▒ 11.027 
>>>>>>>> ns/op
>>>>>>>> GetStruct.panama_get_structonly  avgt   50  370.783 ▒ 13.744 
>>>>>>>> ns/op
>>>>>>>> 
>>>>>>>> And after:
>>>>>>>> 
>>>>>>>> Benchmark                        Mode  Cnt   Score Error Units
>>>>>>>> GetStruct.jni_baseline           avgt   50  13.941 ▒ 0.485 ns/op
>>>>>>>> GetStruct.panama_get_both        avgt   50  41.199 ▒ 1.632 ns/op
>>>>>>>> GetStruct.panama_get_fieldonly   avgt   50  33.432 ▒ 1.889 ns/op
>>>>>>>> GetStruct.panama_get_structonly  avgt   50  13.469 ▒ 0.781 ns/op
>>>>>>>> 
>>>>>>>> Where panama_get_structonly corresponds to 1., and 
>>>>>>>> panama_get_fieldonly corresponds to 2. For a total of about 12x 
>>>>>>>> speedup.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Jorn
>>>>>>>> 
>>>>>>>> [1] : 
>>>>>>>> https://mail.openjdk.java.net/pipermail/panama-dev/2019-May/005469.html 
>>>>>>>> [2] : https://openjdk.java.net/jeps/230


More information about the panama-dev mailing list