Vector API: How to write code template specialization ?

Wed Apr 15 18:48:08 UTC 2020

----- Mail original -----
> De: "Sandhya Viswanathan" <sandhya.viswanathan at intel.com>
> À: "Remi Forax" <forax at univ-mlv.fr>, "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
> Cc: "Paul Sandoz" <paul.sandoz at oracle.com>, "Vladimir Ivanov" <vladimir.x.ivanov at oracle.com>
> Envoyé: Mercredi 15 Avril 2020 18:48:56
> Objet: RE: Vector API: How to write code template specialization ?

> Hi Remi,
> 
> You might already know the following:
> There is a compile command to force inline, wonder if you could use that for
> your experiments in the meantime.
> -XX:CompileCommand=inline,class_path.method
> You could also give these in a file and specify that file on command line
> instead using:
> -XX:CompileCommandFile=<file>
> Where the <file> contains lines like:
> inline class1_path.method
> inline class2_path.*
> More info in src/hotspot/share/compiler/compilerOracle.cpp.
> 
> Hope this helps.

It's great if you can control the command line,
but i've discovered that this is not enough,
i've benchmarks that takes a looooooong time before reaching the standy state, like this one
(each iteration takes 5 seconds !)

# Warmup Iteration   1: 2262.121 us/op
# Warmup Iteration   2: 2199.655 us/op
# Warmup Iteration   3: 121.482 us/op
# Warmup Iteration   4: 72.948 us/op
# Warmup Iteration   5: 67.649 us/op
# Warmup Iteration   6: 67.721 us/op
# Warmup Iteration   7: 68.968 us/op
# Warmup Iteration   8: 67.586 us/op
# Warmup Iteration   9: 67.697 us/op
# Warmup Iteration  10: 68.652 us/op

The problem is that the code is like this
  ... foldValueAdd() {
    foldValue(..., VectorOperators.ADD)
  }
  ... foldValue(..., Binary binary) {
    some loop that are using the Vector API + binary
  }

so the VM mark the method foldValue as hot while looping in it, so for the next call, foldValue is hot but foldAdd is still cold because it has only being called a dozen of times, so the JIT will not optimize foldValueAdd but only foldValue which is stupid because the constant allowing to fully optimize foldValue is in foldValueAdd, and foldValueAdd will be optimized several tens of seconds later if you let the benchmark to run a long time.

I still think there is a need for a way to create a template specializable by taking VectorOperations.

> 
> Thanks a lot for your feedback.
> 
> Best Regards,
> Sandhya
> 

regards,
Rémi

> 
> -----Original Message-----
> From: panama-dev <panama-dev-bounces at openjdk.java.net> On Behalf Of Remi Forax
> Sent: Monday, April 13, 2020 10:26 AM
> To: panama-dev at openjdk.java.net' <panama-dev at openjdk.java.net>
> Subject: Vector API: How to write code template specialization ?
> 
> Hi all,
> as a kind of study to see how to use the vector API to implement a simple
> runtime [1] for J (that weird descendant of APL :)
> 
> It works quite well until you try to share code, let say i have a code to do a
> reduction on an array, i can write one version for +, one version for *, etc,
> or i can write a method that takes a VectorOperations as parameter and the JIT
> will be smart enough to figure that if i call the method with the constant
> VectorOperations.ADD, i want the JIT to specialize the method for ADD.
> 
> So in my runtime, have a method foldValueADD that calls foldValueTemplate(ADD).
> 
> An it fails spectacularly because the JIT think that the template function
> foldValueTemplate is too big to be inlined.
> 
> fr.umlv.vector.CellBenchMark::add_cell (14 bytes)
>   @ 7   fr.umlv.jruntime.Cell$Dyad::fold (12 bytes)   inline (hot)
>     @ 8   fr.umlv.jruntime.Cell$Fold::<init> (56 bytes)   inline (hot)
>       @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
>       @ 40   java.util.Objects::requireNonNull (14 bytes)   inline (hot)
>   @ 10   fr.umlv.jruntime.Cell::apply (132 bytes)   inline (hot)
>     @ 1   fr.umlv.jruntime.Cell$Fold::foldVerbs (20 bytes)   inline (hot)
>     @ 58   fr.umlv.jruntime.Cell$Rank$Vector::fold (6 bytes)   inline (hot)
>       @ 2   fr.umlv.jruntime.Cell$Rank$Vector::foldValue (31 bytes)   inline (hot)
>         @ 8   fr.umlv.jruntime.Cell$Backend::foldValue (193 bytes)   inline (hot)
>           @ 21   java.lang.Enum::ordinal (5 bytes)   accessor
>           @ 86   fr.umlv.jruntime.Cell$VectorizedBackend::foldValueADD (14 bytes)   inline
>           (hot)
>             @ 2   java.lang.invoke.Invokers$Holder::linkToTargetMethod (8 bytes)   force
>             inline by annotation
>               @ 4   java.lang.invoke.LambdaForm$MH/0x0000000800067840::invoke (8 bytes)
>               force inline by annotation
>             @ 10   fr.umlv.jruntime.Cell$VectorizedBackend::foldValueTemplate (110 bytes)
>             already compiled into a big method
>         @ 17   fr.umlv.jruntime.Cell$Rank::vector (9 bytes)   inline (hot)
>           @ 5   fr.umlv.jruntime.Cell$Rank$Vector::<init> (10 bytes)   inline (hot)
>             @ 1   java.lang.Record::<init> (5 bytes)   inline (hot)
>               @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
>         @ 27   fr.umlv.jruntime.Cell::<init> (15 bytes)   inline (hot)
>           @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
> 
> 
> Given that i'm not developing my code inside the JDK, i can not have access to
> @ForceInlining.
> 
> I think the JIT heuristics need to be tweaked so a method that takes a constants
> of VectorOperations as parameter is always inlined.
> Otherwise, there is no point to expose all the constants in VectorOperations
> given that even a simple reduction takes enough bytecodes to be considered as a
> big method for the JIT.
> 
> Or maybe there is another solution ?
> 
> regards,
> Rémi
> 
> [1]
> https://github.com/forax/panama-vector/blob/master/fr.umlv.jruntime/src/main/java/fr/umlv/jruntime/Cell.java#L787