Vector API: How to write code template specialization ?

Thu Apr 16 13:00:19 UTC 2020

I end up using unsafe.defineAnonymousClass to inject the VectorOperations by constant pool patching [1],
not something i'm proud of, but at at least it solves the issue with the JIT being slow to ramp up.
I will test with defineHiddenClass in the future when it will be integrated in jdk/jdk.

Rémi
[1] https://github.com/forax/panama-vector/blob/master/fr.umlv.jruntime/src/main/java/fr/umlv/jruntime/Cell.java#L851

----- Mail original -----
> De: "Remi Forax" <forax at univ-mlv.fr>
> À: "Sandhya Viswanathan" <sandhya.viswanathan at intel.com>
> Cc: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
> Envoyé: Mercredi 15 Avril 2020 20:48:08
> Objet: Re: Vector API: How to write code template specialization ?

> ----- Mail original -----
>> De: "Sandhya Viswanathan" <sandhya.viswanathan at intel.com>
>> À: "Remi Forax" <forax at univ-mlv.fr>, "panama-dev at openjdk.java.net'"
>> <panama-dev at openjdk.java.net>
>> Cc: "Paul Sandoz" <paul.sandoz at oracle.com>, "Vladimir Ivanov"
>> <vladimir.x.ivanov at oracle.com>
>> Envoyé: Mercredi 15 Avril 2020 18:48:56
>> Objet: RE: Vector API: How to write code template specialization ?
> 
>> Hi Remi,
>> 
>> You might already know the following:
>> There is a compile command to force inline, wonder if you could use that for
>> your experiments in the meantime.
>> -XX:CompileCommand=inline,class_path.method
>> You could also give these in a file and specify that file on command line
>> instead using:
>> -XX:CompileCommandFile=<file>
>> Where the <file> contains lines like:
>> inline class1_path.method
>> inline class2_path.*
>> More info in src/hotspot/share/compiler/compilerOracle.cpp.
>> 
>> Hope this helps.
> 
> It's great if you can control the command line,
> but i've discovered that this is not enough,
> i've benchmarks that takes a looooooong time before reaching the standy state,
> like this one
> (each iteration takes 5 seconds !)
> 
> # Warmup Iteration   1: 2262.121 us/op
> # Warmup Iteration   2: 2199.655 us/op
> # Warmup Iteration   3: 121.482 us/op
> # Warmup Iteration   4: 72.948 us/op
> # Warmup Iteration   5: 67.649 us/op
> # Warmup Iteration   6: 67.721 us/op
> # Warmup Iteration   7: 68.968 us/op
> # Warmup Iteration   8: 67.586 us/op
> # Warmup Iteration   9: 67.697 us/op
> # Warmup Iteration  10: 68.652 us/op
> 
> The problem is that the code is like this
>  ... foldValueAdd() {
>    foldValue(..., VectorOperators.ADD)
>  }
>  ... foldValue(..., Binary binary) {
>    some loop that are using the Vector API + binary
>  }
> 
> so the VM mark the method foldValue as hot while looping in it, so for the next
> call, foldValue is hot but foldAdd is still cold because it has only being
> called a dozen of times, so the JIT will not optimize foldValueAdd but only
> foldValue which is stupid because the constant allowing to fully optimize
> foldValue is in foldValueAdd, and foldValueAdd will be optimized several tens
> of seconds later if you let the benchmark to run a long time.
> 
> I still think there is a need for a way to create a template specializable by
> taking VectorOperations.
> 
>> 
>> Thanks a lot for your feedback.
>> 
>> Best Regards,
>> Sandhya
>> 
> 
> regards,
> Rémi
> 
>> 
>> -----Original Message-----
>> From: panama-dev <panama-dev-bounces at openjdk.java.net> On Behalf Of Remi Forax
>> Sent: Monday, April 13, 2020 10:26 AM
>> To: panama-dev at openjdk.java.net' <panama-dev at openjdk.java.net>
>> Subject: Vector API: How to write code template specialization ?
>> 
>> Hi all,
>> as a kind of study to see how to use the vector API to implement a simple
>> runtime [1] for J (that weird descendant of APL :)
>> 
>> It works quite well until you try to share code, let say i have a code to do a
>> reduction on an array, i can write one version for +, one version for *, etc,
>> or i can write a method that takes a VectorOperations as parameter and the JIT
>> will be smart enough to figure that if i call the method with the constant
>> VectorOperations.ADD, i want the JIT to specialize the method for ADD.
>> 
>> So in my runtime, have a method foldValueADD that calls foldValueTemplate(ADD).
>> 
>> An it fails spectacularly because the JIT think that the template function
>> foldValueTemplate is too big to be inlined.
>> 
>> fr.umlv.vector.CellBenchMark::add_cell (14 bytes)
>>   @ 7   fr.umlv.jruntime.Cell$Dyad::fold (12 bytes)   inline (hot)
>>     @ 8   fr.umlv.jruntime.Cell$Fold::<init> (56 bytes)   inline (hot)
>>       @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
>>       @ 40   java.util.Objects::requireNonNull (14 bytes)   inline (hot)
>>   @ 10   fr.umlv.jruntime.Cell::apply (132 bytes)   inline (hot)
>>     @ 1   fr.umlv.jruntime.Cell$Fold::foldVerbs (20 bytes)   inline (hot)
>>     @ 58   fr.umlv.jruntime.Cell$Rank$Vector::fold (6 bytes)   inline (hot)
>>       @ 2   fr.umlv.jruntime.Cell$Rank$Vector::foldValue (31 bytes)   inline (hot)
>>         @ 8   fr.umlv.jruntime.Cell$Backend::foldValue (193 bytes)   inline (hot)
>>           @ 21   java.lang.Enum::ordinal (5 bytes)   accessor
>>           @ 86   fr.umlv.jruntime.Cell$VectorizedBackend::foldValueADD (14 bytes)   inline
>>           (hot)
>>             @ 2   java.lang.invoke.Invokers$Holder::linkToTargetMethod (8 bytes)   force
>>             inline by annotation
>>               @ 4   java.lang.invoke.LambdaForm$MH/0x0000000800067840::invoke (8 bytes)
>>               force inline by annotation
>>             @ 10   fr.umlv.jruntime.Cell$VectorizedBackend::foldValueTemplate (110 bytes)
>>             already compiled into a big method
>>         @ 17   fr.umlv.jruntime.Cell$Rank::vector (9 bytes)   inline (hot)
>>           @ 5   fr.umlv.jruntime.Cell$Rank$Vector::<init> (10 bytes)   inline (hot)
>>             @ 1   java.lang.Record::<init> (5 bytes)   inline (hot)
>>               @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
>>         @ 27   fr.umlv.jruntime.Cell::<init> (15 bytes)   inline (hot)
>>           @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
>> 
>> 
>> Given that i'm not developing my code inside the JDK, i can not have access to
>> @ForceInlining.
>> 
>> I think the JIT heuristics need to be tweaked so a method that takes a constants
>> of VectorOperations as parameter is always inlined.
>> Otherwise, there is no point to expose all the constants in VectorOperations
>> given that even a simple reduction takes enough bytecodes to be considered as a
>> big method for the JIT.
>> 
>> Or maybe there is another solution ?
>> 
>> regards,
>> Rémi
>> 
>> [1]
> > https://github.com/forax/panama-vector/blob/master/fr.umlv.jruntime/src/main/java/fr/umlv/jruntime/Cell.java#L787