slow performance of loom continuations
seth lytle
seth.lytle at gmail.com
Thu Oct 4 18:29:33 UTC 2018
> i'll try to download the latest loom tonight (i have limited bandwidth)
long overdue followup - finally remembered to download the latest
tip.tar.bz2 last night (ebaa9b6ab5fb)
for ii in {100..300..20}; do printf "N=%4d " $ii; $loom/bin/java -cp .:$cp
Xorshift 200000 10 $ii | tail -n 1; done
N= 100 ==== kilim average : 933.72 nanos/op
N= 120 ==== kilim average : 1040.01 nanos/op
N= 140 ==== kilim average : 1193.13 nanos/op
N= 160 ==== kilim average : 1375.73 nanos/op
N= 180 ==== kilim average : 1508.22 nanos/op
N= 200 ==== kilim average : 1715.63 nanos/op
N= 220 ==== kilim average : 1955.03 nanos/op
N= 240 ==== kilim average : 2022.04 nanos/op
N= 260 ==== kilim average : 2145.09 nanos/op
N= 280 ==== kilim average : 2326.35 nanos/op
N= 300 ==== kilim average : 2551.17 nanos/op
for ii in {100..300..20}; do printf "N=%4d " $ii; $loom/bin/java $loomargs
../XorMulti.java 200000 10 $ii | tail -n 1; done
N= 100 ==== loom average : 1426.19 nanos/op
N= 120 ==== loom average : 1441.29 nanos/op
N= 140 ==== loom average : 1452.80 nanos/op
N= 160 ==== loom average : 1469.27 nanos/op
N= 180 ==== loom average : 1453.50 nanos/op
N= 200 ==== loom average : 1442.56 nanos/op
N= 220 ==== loom average : 1480.15 nanos/op
N= 240 ==== loom average : 1464.65 nanos/op
N= 260 ==== loom average : 1961.71 nanos/op
N= 280 ==== loom average : 2152.45 nanos/op
N= 300 ==== loom average : 1938.06 nanos/op
echo $loomargs
-XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode -XX:+UseNewCode2
-XX:+UseNewCode3
both variants are running on the loom jvm. break even point is
approximately a stack depth of 170
On Tue, Sep 18, 2018 at 2:32 PM, seth lytle <seth.lytle at gmail.com> wrote:
> Hi Hamlin,
>
> thanks for taking a look at this
>
> - kilim.Task is a higher level construct than kilim.Continuation that
> handles scheduling
> - kilim doesn't make any attempt to be clever so the kilim.Continuation
> overhead should grow linearly with stack depth
> - the number of continuations shouldn't affect kilim (this is it's primary
> use-case)
> - sorry for the build issues - the kilim/kilim github is quite out of date
> - kilim 2.0 is available in maven central as org.db4j:kilim:2.0.0-23 or at
> https://github.com/db4j/kilim
> - i was a bit scared to run the N=10000 depth case !
>
>
> as expected, kilim scaled with stack depth. in ron's loom presentation, he
> discussed a stack barrier, which allows loom to copy only the bottom of the
> stack (i'm assuming this explains the constant scale). the concept could be
> applied to kilim as well and it's been discussed, but it's not a use-case
> that kilim is esp interested in (kilim is limited in terms of what code it
> can weave, ie only methods that declare to throw kilim.Pausable, so that
> rules out any code that's not written specifically for kilim, so eg to
> integrate it with spring, you'd use the spring async api instead of
> attempting to weave spring)
>
>
> stack depth:
> N= 0 ==== kilim average : 71.80 nanos/op
> N= 1 ==== kilim average : 85.03 nanos/op
> N= 10 ==== kilim average : 167.63 nanos/op
> N= 100 ==== kilim average : 1193.66 nanos/op
> N= 500 ==== kilim average : 6359.27 nanos/op
> N=1000 ==== kilim average : 10047.60 nanos/op
> N=3000 ==== kilim average : 33714.19 nanos/op
> N=5000 ==== kilim average : 57494.85 nanos/op
> N=10000 ==== java.lang.StackOverflowError
>
> number of continuations:
> N= 1 ==== kilim average : 71.80 nanos/op
> N= 10 ==== kilim average : 75.97 nanos/op
> N= 20 ==== kilim average : 84.31 nanos/op
> N= 30 ==== kilim average : 71.78 nanos/op
> N= 50 ==== kilim average : 74.97 nanos/op
> N= 100 ==== kilim average : 72.62 nanos/op
> N= 110 ==== kilim average : 74.05 nanos/op
> N= 120 ==== kilim average : 76.38 nanos/op
> N= 125 ==== kilim average : 75.01 nanos/op
> N= 128 ==== kilim average : 79.61 nanos/op
> N= 130 ==== kilim average : 75.09 nanos/op
> N= 500 ==== kilim average : 77.71 nanos/op
> N=1000 ==== kilim average : 75.08 nanos/op
> N=10000 ==== kilim average : 77.39 nanos/op
> N=100000 ==== kilim average : 80.73 nanos/op
>
> i'll try to download the latest loom tonight (i have limited bandwidth)
>
> i'm not sure that there's much to benefit by comparing kilim and loom in
> general - kilim is very limited in what code it can weave, and loom aspires
> to run almost everything. so the use-cases are quite different. but if any
> of this helps identify bottlenecks, then it's time well spent
>
>
> source code is attached (the changes are minor)
>
>
>
>
>
>
> On Tue, Sep 18, 2018 at 1:17 AM, Hamlin Li <huaming.li at oracle.com> wrote:
>
>> Hi Seth,
>>
>> Thank you for comparing continuation performance between kilim and loom.
>> After I spent some time to study the continuation(called Task?)
>> implementation in kilim, I'm interested in another 2 performance data for
>> kilim/loom continuation.
>> 1. when continuation stack depth grows, what's the performance?
>> 2. when continuation number grows, what's the performance?
>>
>> I refactored your code a little to get above performance data for loom.
>> (code is attached at the bottom)
>>
>> following is data with continuation stack depth N changing from 1 to
>> 10,000 and only 1 continuation in total:
>> (command to run: java -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode
>> -XX:+UseNewCode2 -XX:+UseNewCode3 Xorshift 200000 10 N)
>> continuation stack depth: 1, perf: 1610.26 nanos/op
>> continuation stack depth: 10, perf: 1607.82 nanos/op
>> continuation stack depth: 100, perf: 1566.41 nanos/op
>> continuation stack depth: *500*, perf: *2179.94* nanos/op
>> continuation stack depth: *1000*, perf: *2145.76* nanos/op
>> continuation stack depth: *3000*, perf: *2105.57* nanos/op
>> continuation stack depth: 5000, perf: 1517.82 nanos/op
>> continuation stack depth: 10000, perf: 1673.76 nanos/op
>>
>> following is data with continuation number N changing from 1 to 100,000
>> and continuation stack depth only 1 for every continuation:
>> (command to run: java -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode
>> -XX:+UseNewCode2 -XX:+UseNewCode3 Xorshift 200000 10 1 N)
>> continuation number: 1, perf: 1514.10 nanos/op
>> continuation number: 10, perf: 1936.75 nanos/op
>> continuation number: 30, perf: 2455.95 nanos/op
>> continuation number: *50*, perf: *3214.05* nanos/op
>> continuation number: *100*, perf: *4241.50* nanos/op
>> continuation number: *110*, perf: *3969.00* nanos/op
>> continuation number: *120*, perf: *4822.41* nanos/op
>> continuation number: *125*, perf: *5080.64* nanos/op
>> continuation number: *128*, perf: *7963.00* nanos/op
>> continuation number: 130, perf: 1907.96 nanos/op
>> continuation number: 500, perf: 1854.22 nanos/op
>> continuation number: 1000, perf: 1844.98 nanos/op
>> continuation number: 10000, perf: 2064.03 nanos/op
>> continuation number: 100000, perf: 2156.77 nanos/op
>>
>> Above data for loom shows that although base is quite high(more than 1000
>> nanos/op with 1 continuation and stack depth 1), when stack depth or
>> continuation number grows, time/op grows very very slowly and it even
>> reduces when continuation number and stack depth keep growing(I'm not sure
>> what's the cause).
>>
>> I'm interested in the data for kilim, I tried to get it myself, but I
>> failed to build kilim myself, would you mind to help to get the similar
>> data for kilim?
>> With this data, we can compare kilim with loom more comprehensively, and
>> it might help to find out the performance bottleneck and improve the loom's
>> continuation performance in the future too.
>>
>> Thank you
>> -Hamlin
>>
>>
>> ==================================================================
>> public class Xorshift {
>> static final ContinuationScope SCOPE = new ContinuationScope() {};
>> Continuation ctus[];
>> long result;
>> int depth;
>> boolean debug;
>> int ctuNum;
>> double sum;
>> long times;
>>
>> void average() {
>> System.out.format("==== loom average : %-10.2f nanos/op%n",
>> sum/times);
>> }
>>
>> Xorshift(int depth, int ctuNum, boolean debug) {
>> this.depth = depth;
>> this.ctuNum = ctuNum;
>> this.debug = debug;
>> ctus = new Continuation[ctuNum];
>> for (int i = 0; i < ctuNum; i++) {
>> ctus[i] = new Continuation(SCOPE, this::recursiveCall);
>> }
>> }
>>
>> void warmup(long num) {
>> long warmup = 5000000000L;
>> final long start = System.nanoTime();
>> long dummy = 0;
>> while (System.nanoTime() - start < warmup)
>> dummy = dummy ^ loop(num);
>> System.out.println("warmup: " + dummy);
>> }
>> void cycle(long num) {
>> final long start = System.nanoTime();
>> long val = loop(num);
>> long duration = System.nanoTime() - start;
>> double tmp = 1.0*duration/num;
>> sum += tmp;
>> times++;
>> System.out.format("loom : %-10.2f nanos/op, %30d\n", tmp, val);
>> }
>> public long loop(long num) {
>> long val = 0;
>> for (int ii=0; ii < num; ii++) {
>> ctus[ii%ctuNum].run();
>> val = val ^ result;
>> }
>> return val;
>> }
>>
>> void recursiveCall() {
>> //System.out.println("initial d: " + depth);
>> recursiveCall(depth);
>> }
>> void recursiveCall(int d) {
>> //System.out.println("d: " + d);
>> while(d > 0) {
>> recursiveCall(--d);
>> }
>> execute();
>> }
>> public void execute() {
>> if (debug) {
>> new Throwable().printStackTrace();
>> }
>> long x, y, s0=103, s1=17;
>> while (true) {
>> x = s0;
>> y = s1;
>> s0 = y;
>> x ^= (x << 23);
>> s1 = x ^ y ^ (x >> 17) ^ (y >> 26);
>> result = (s1 + y);
>> Continuation.yield(SCOPE);
>> }
>> }
>>
>> public static void main(String[] args) {
>>
>> long cycles = 200000;
>> int reps = 10;
>> int depth = 0;
>> boolean debug = false;
>> int ctuNum = 1;
>>
>> try { cycles = Long.parseLong(args[0]); } catch (Exception ex) {}
>> try { reps = Integer.parseInt(args[1]); } catch (Exception ex) {}
>> try { depth = Integer.parseInt(args[2]); } catch (Exception ex) {}
>> try { ctuNum = Integer.parseInt(args[3]); } catch (Exception ex)
>> {}
>> try { debug = Boolean.parseBoolean(args[4]); } catch (Exception
>> ex) {}
>>
>> if (args.length == 0) {
>> System.out.println("args: number of cycles, number of
>> repeats, depth of recursive calls, continuation number, debug");
>> System.out.format("\t no args provided using defaults: %d %d
>> %d %d %b\n",cycles, reps, depth, ctuNum, debug);
>> } else {
>> System.out.format("\t cycle: %d, reps: %d, depth: %d, ctuNum:
>> %d, debug: %b\n",cycles, reps, depth, ctuNum, debug);
>> }
>>
>> new Xorshift(depth, ctuNum, debug).warmup(cycles);
>> Xorshift xor = new Xorshift(depth, ctuNum, debug);
>>
>> for (int jj=0; jj < reps; jj++) {
>> //Xorshift xor = new Xorshift(depth, ctuNum, debug);
>> xor.cycle(cycles);
>> }
>> xor.average();
>> }
>> }
>>
>>
>> On 2018/9/7 12:47 PM, seth lytle wrote:
>>
>> i ported a Xorshift implementation from kilim to project loom using the
>> prototype that was announced in august. the Continuation apis are similar -
>> the only changes are ctu.run() instead of run()
>> and Continuation.yield(SCOPE) instead of kilim.Fiber.yield(), and the code
>> runs and produces the same answers. however, performance with loom is on
>> the order of 500x slower. i tried both f46dc5c01b7d (2% faster) and
>> 544bfe4ccd84
>>
>> kilim: 32.30 nanos/op, -4971871656801550503
>> kilim: 36.99 nanos/op, 2357119851256129241
>> kilim: 32.60 nanos/op, 8340372355868382387
>>
>> loom : 18303.18 nanos/op, -4971871656801550503
>> loom : 18107.61 nanos/op, 2357119851256129241
>> loom : 18184.15 nanos/op, 8340372355868382387
>>
>>
>> with -XX:+UnlockExperimentalVMOptions -XX:+UseNewCode
>>
>> loom : 2072.82 nanos/op, -4971871656801550503
>> loom : 1879.37 nanos/op, 2357119851256129241
>> loom : 1841.58 nanos/op, 8340372355868382387
>>
>>
>> am i using the api correctly ? i do use exceptions in other continuations,
>> so UseNewCode isn't really an option for me
>>
>> i realize that this is an early prototype. do you have any idea what
>> performance you're ultimately shooting for for this sort of loop ?
>>
>> is there a tradeoff between performance in these simple cases and being
>> able to weave "most" production code ?
>>
>>
>>
>> here's the full example:
>>
>>
>> public class Xorshift {
>> static final ContinuationScope SCOPE = new ContinuationScope() {};
>> Continuation ctu = new Continuation(SCOPE,this::execute);
>> long result;
>>
>> void warmup(long num) {
>> long warmup = 5000000000L;
>> final long start = System.nanoTime();
>> long dummy = 0;
>> while (System.nanoTime() - start < warmup)
>> dummy = dummy ^ loop(num);
>> System.out.println("warmup: " + dummy);
>> }
>> void cycle(long num) {
>> final long start = System.nanoTime();
>> long val = loop(num);
>> long duration = System.nanoTime() - start;
>> System.out.format("loom : %-10.2f nanos/op, %30d\n",
>> 1.0*duration/num, val);
>> }
>> public long loop(long num) {
>> long val = 0;
>> for (int ii=0; ii < num; ii++) {
>> ctu.run();
>> val = val ^ result;
>> }
>> return val;
>> }
>> public void execute() {
>> long x, y, s0=103, s1=17;
>> while (true) {
>> x = s0;
>> y = s1;
>> s0 = y;
>> x ^= (x << 23);
>> s1 = x ^ y ^ (x >> 17) ^ (y >> 26);
>> result = (s1 + y);
>> Continuation.yield(SCOPE);
>> }
>> }
>>
>> public static void main(String[] args) {
>>
>> long cycles = 200000;
>> int reps = 10;
>> if (args.length == 0) {
>> System.out.println("args: number of cycles, number of repeats");
>> System.out.format("\t no args provided using defaults: %d
>> %d\n",cycles,reps);
>> }
>> try { cycles = Long.parseLong(args[0]); } catch (Exception ex) {}
>> try { reps = Integer.parseInt(args[1]); } catch (Exception ex) {}
>>
>> new Xorshift().warmup(cycles);
>> Xorshift xor = new Xorshift();
>>
>> for (int jj=0; jj < reps; jj++)
>> xor.cycle(cycles);
>> }
>> }
>>
>>
>>
>
More information about the loom-dev
mailing list