Vector API: How to write code template specialization ?
Remi Forax
forax at univ-mlv.fr
Thu Apr 16 13:00:19 UTC 2020
I end up using unsafe.defineAnonymousClass to inject the VectorOperations by constant pool patching [1],
not something i'm proud of, but at at least it solves the issue with the JIT being slow to ramp up.
I will test with defineHiddenClass in the future when it will be integrated in jdk/jdk.
Rémi
[1] https://github.com/forax/panama-vector/blob/master/fr.umlv.jruntime/src/main/java/fr/umlv/jruntime/Cell.java#L851
----- Mail original -----
> De: "Remi Forax" <forax at univ-mlv.fr>
> À: "Sandhya Viswanathan" <sandhya.viswanathan at intel.com>
> Cc: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
> Envoyé: Mercredi 15 Avril 2020 20:48:08
> Objet: Re: Vector API: How to write code template specialization ?
> ----- Mail original -----
>> De: "Sandhya Viswanathan" <sandhya.viswanathan at intel.com>
>> À: "Remi Forax" <forax at univ-mlv.fr>, "panama-dev at openjdk.java.net'"
>> <panama-dev at openjdk.java.net>
>> Cc: "Paul Sandoz" <paul.sandoz at oracle.com>, "Vladimir Ivanov"
>> <vladimir.x.ivanov at oracle.com>
>> Envoyé: Mercredi 15 Avril 2020 18:48:56
>> Objet: RE: Vector API: How to write code template specialization ?
>
>> Hi Remi,
>>
>> You might already know the following:
>> There is a compile command to force inline, wonder if you could use that for
>> your experiments in the meantime.
>> -XX:CompileCommand=inline,class_path.method
>> You could also give these in a file and specify that file on command line
>> instead using:
>> -XX:CompileCommandFile=<file>
>> Where the <file> contains lines like:
>> inline class1_path.method
>> inline class2_path.*
>> More info in src/hotspot/share/compiler/compilerOracle.cpp.
>>
>> Hope this helps.
>
> It's great if you can control the command line,
> but i've discovered that this is not enough,
> i've benchmarks that takes a looooooong time before reaching the standy state,
> like this one
> (each iteration takes 5 seconds !)
>
> # Warmup Iteration 1: 2262.121 us/op
> # Warmup Iteration 2: 2199.655 us/op
> # Warmup Iteration 3: 121.482 us/op
> # Warmup Iteration 4: 72.948 us/op
> # Warmup Iteration 5: 67.649 us/op
> # Warmup Iteration 6: 67.721 us/op
> # Warmup Iteration 7: 68.968 us/op
> # Warmup Iteration 8: 67.586 us/op
> # Warmup Iteration 9: 67.697 us/op
> # Warmup Iteration 10: 68.652 us/op
>
> The problem is that the code is like this
> ... foldValueAdd() {
> foldValue(..., VectorOperators.ADD)
> }
> ... foldValue(..., Binary binary) {
> some loop that are using the Vector API + binary
> }
>
> so the VM mark the method foldValue as hot while looping in it, so for the next
> call, foldValue is hot but foldAdd is still cold because it has only being
> called a dozen of times, so the JIT will not optimize foldValueAdd but only
> foldValue which is stupid because the constant allowing to fully optimize
> foldValue is in foldValueAdd, and foldValueAdd will be optimized several tens
> of seconds later if you let the benchmark to run a long time.
>
> I still think there is a need for a way to create a template specializable by
> taking VectorOperations.
>
>>
>> Thanks a lot for your feedback.
>>
>> Best Regards,
>> Sandhya
>>
>
> regards,
> Rémi
>
>>
>> -----Original Message-----
>> From: panama-dev <panama-dev-bounces at openjdk.java.net> On Behalf Of Remi Forax
>> Sent: Monday, April 13, 2020 10:26 AM
>> To: panama-dev at openjdk.java.net' <panama-dev at openjdk.java.net>
>> Subject: Vector API: How to write code template specialization ?
>>
>> Hi all,
>> as a kind of study to see how to use the vector API to implement a simple
>> runtime [1] for J (that weird descendant of APL :)
>>
>> It works quite well until you try to share code, let say i have a code to do a
>> reduction on an array, i can write one version for +, one version for *, etc,
>> or i can write a method that takes a VectorOperations as parameter and the JIT
>> will be smart enough to figure that if i call the method with the constant
>> VectorOperations.ADD, i want the JIT to specialize the method for ADD.
>>
>> So in my runtime, have a method foldValueADD that calls foldValueTemplate(ADD).
>>
>> An it fails spectacularly because the JIT think that the template function
>> foldValueTemplate is too big to be inlined.
>>
>> fr.umlv.vector.CellBenchMark::add_cell (14 bytes)
>> @ 7 fr.umlv.jruntime.Cell$Dyad::fold (12 bytes) inline (hot)
>> @ 8 fr.umlv.jruntime.Cell$Fold::<init> (56 bytes) inline (hot)
>> @ 1 java.lang.Object::<init> (1 bytes) inline (hot)
>> @ 40 java.util.Objects::requireNonNull (14 bytes) inline (hot)
>> @ 10 fr.umlv.jruntime.Cell::apply (132 bytes) inline (hot)
>> @ 1 fr.umlv.jruntime.Cell$Fold::foldVerbs (20 bytes) inline (hot)
>> @ 58 fr.umlv.jruntime.Cell$Rank$Vector::fold (6 bytes) inline (hot)
>> @ 2 fr.umlv.jruntime.Cell$Rank$Vector::foldValue (31 bytes) inline (hot)
>> @ 8 fr.umlv.jruntime.Cell$Backend::foldValue (193 bytes) inline (hot)
>> @ 21 java.lang.Enum::ordinal (5 bytes) accessor
>> @ 86 fr.umlv.jruntime.Cell$VectorizedBackend::foldValueADD (14 bytes) inline
>> (hot)
>> @ 2 java.lang.invoke.Invokers$Holder::linkToTargetMethod (8 bytes) force
>> inline by annotation
>> @ 4 java.lang.invoke.LambdaForm$MH/0x0000000800067840::invoke (8 bytes)
>> force inline by annotation
>> @ 10 fr.umlv.jruntime.Cell$VectorizedBackend::foldValueTemplate (110 bytes)
>> already compiled into a big method
>> @ 17 fr.umlv.jruntime.Cell$Rank::vector (9 bytes) inline (hot)
>> @ 5 fr.umlv.jruntime.Cell$Rank$Vector::<init> (10 bytes) inline (hot)
>> @ 1 java.lang.Record::<init> (5 bytes) inline (hot)
>> @ 1 java.lang.Object::<init> (1 bytes) inline (hot)
>> @ 27 fr.umlv.jruntime.Cell::<init> (15 bytes) inline (hot)
>> @ 1 java.lang.Object::<init> (1 bytes) inline (hot)
>>
>>
>> Given that i'm not developing my code inside the JDK, i can not have access to
>> @ForceInlining.
>>
>> I think the JIT heuristics need to be tweaked so a method that takes a constants
>> of VectorOperations as parameter is always inlined.
>> Otherwise, there is no point to expose all the constants in VectorOperations
>> given that even a simple reduction takes enough bytecodes to be considered as a
>> big method for the JIT.
>>
>> Or maybe there is another solution ?
>>
>> regards,
>> Rémi
>>
>> [1]
> > https://github.com/forax/panama-vector/blob/master/fr.umlv.jruntime/src/main/java/fr/umlv/jruntime/Cell.java#L787
More information about the panama-dev
mailing list