From brent.christian at oracle.com  Tue Aug 15 23:53:14 2023
From: brent.christian at oracle.com (Brent Christian)
Date: Tue, 15 Aug 2023 16:53:14 -0700
Subject: Isolating setup of StackWalker test call stack in JMH?
Message-ID: <8c066010-4418-b505-b32f-9fe408a8e3c7@oracle.com>

Greetings, JMH users and enthusiasts.

The JDK microbenchmarks include benchmarks to measure the performance of java.lang.StackWalker operations. A 'depth' @Param is used to test various call stack depths. See:

https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/lang/StackWalkBench.java

The TestStack nested class (L69) builds up 'n' method calls, and then run()s the SW method under test once n=depth calls are on the call stack. The issue is that this building up of the call stack is "setup" code, but is executed during the benchmark, influencing the scoring. In particular, it skews the StackWalker.getCallerClass() results, which should perform the same regardless of stack depth. E.g.:

Benchmark                      (depth)  Mode  Cnt     Score   Error  Units
StackWalkBench.getCallerClass        4  avgt   15   445.778 ? 8.920  ns/op
StackWalkBench.getCallerClass      100  avgt   15   496.285 ? 6.631  ns/op
StackWalkBench.getCallerClass     1000  avgt   15  1563.743 ? 7.601  ns/op

Because of the unique needs of setting up a StackWalker call at an artificial stack depth, the setup doesn't seem like something that could be moved into a @Setup method. I looked through the JMH samples, and the only promising thing I found was @AuxCounter. Maybe I could add an AuxCounter to separately report *just* the duration spent in the StackWalker call? (Not ideal, but perhaps it's all there is.)

So I gave that a try. The changes are here:
https://github.com/bchristi-git/jdk/compare/master...bchristi-git:jdk:swBench

However the results are not what I expected:

Benchmark                                   (depth)  Mode  Cnt        Score        Error  Units
StackWalkBench.getCallerClass                     4  avgt   10      983.959 ?     10.895  ns/op
StackWalkBench.getCallerClass:benchNanoDur        4  avgt   10  1073151.585 ? 123551.526  ns/op
StackWalkBench.getCallerClass                   100  avgt   10     1063.878 ?     16.790  ns/op
StackWalkBench.getCallerClass:benchNanoDur      100  avgt   10  1085369.870 ?  87001.613  ns/op
StackWalkBench.getCallerClass                  1000  avgt   10     2143.571 ?     17.697  ns/op
StackWalkBench.getCallerClass:benchNanoDur     1000  avgt   10   868953.256 ? 237607.736  ns/op

The reported benchNanoDur score is much greater than I anticipated.

I think don't have a full grasp of how AuxCounters are calculated or meant to be used. I'd appreciate any hints on what I might be doing wrong. :)

(Or, perhaps, other ideas on how JMH could separate out the call-stack-building-setup for this benchmark, if such a thing is possible.)

Much appreciated,
-Brent

From sergey.kuksenko at oracle.com  Wed Aug 16 04:52:00 2023
From: sergey.kuksenko at oracle.com (Sergey Kuksenko)
Date: Wed, 16 Aug 2023 04:52:00 +0000
Subject: Isolating setup of StackWalker test call stack in JMH?
In-Reply-To: <8c066010-4418-b505-b32f-9fe408a8e3c7@oracle.com>
References: <8c066010-4418-b505-b32f-9fe408a8e3c7@oracle.com>
Message-ID: <MN2PR10MB3837E797BF7BCC0B57D4592A9B15A@MN2PR10MB3837.namprd10.prod.outlook.com>

Invoking nanoTime twice per operation, you will get large nanoTime overhead.
https://shipilev.net/blog/2014/nanotrusting-nanotime/
Not sure that this overhead is smaller than the building up of the call stack (at some degree).
By the way, have you looked how Mode.SampleTime works. In this case, JMH measures the time only of SOME operations (not all of them). It's too expensive to measure the time of each operation.

May I ask you - why do you need to move stack construction into "setup"?

________________________________________
From: jmh-jdk-microbenchmarks-dev <jmh-jdk-microbenchmarks-dev-retn at openjdk.org> on behalf of Brent Christian <brent.christian at oracle.com>
Sent: Tuesday, August 15, 2023 4:53 PM
To: jmh-dev at openjdk.org; jmh-jdk-microbenchmarks-dev at openjdk.org
Subject: Isolating setup of StackWalker test call stack in JMH?

Greetings, JMH users and enthusiasts.

The JDK microbenchmarks include benchmarks to measure the performance of java.lang.StackWalker operations. A 'depth' @Param is used to test various call stack depths. See:

https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/lang/StackWalkBench.java

The TestStack nested class (L69) builds up 'n' method calls, and then run()s the SW method under test once n=depth calls are on the call stack. The issue is that this building up of the call stack is "setup" code, but is executed during the benchmark, influencing the scoring. In particular, it skews the StackWalker.getCallerClass() results, which should perform the same regardless of stack depth. E.g.:

Benchmark                      (depth)  Mode  Cnt     Score   Error  Units
StackWalkBench.getCallerClass        4  avgt   15   445.778 ? 8.920  ns/op
StackWalkBench.getCallerClass      100  avgt   15   496.285 ? 6.631  ns/op
StackWalkBench.getCallerClass     1000  avgt   15  1563.743 ? 7.601  ns/op

Because of the unique needs of setting up a StackWalker call at an artificial stack depth, the setup doesn't seem like something that could be moved into a @Setup method. I looked through the JMH samples, and the only promising thing I found was @AuxCounter. Maybe I could add an AuxCounter to separately report *just* the duration spent in the StackWalker call? (Not ideal, but perhaps it's all there is.)

So I gave that a try. The changes are here:
https://github.com/bchristi-git/jdk/compare/master...bchristi-git:jdk:swBench

However the results are not what I expected:

Benchmark                                   (depth)  Mode  Cnt        Score        Error  Units
StackWalkBench.getCallerClass                     4  avgt   10      983.959 ?     10.895  ns/op
StackWalkBench.getCallerClass:benchNanoDur        4  avgt   10  1073151.585 ? 123551.526  ns/op
StackWalkBench.getCallerClass                   100  avgt   10     1063.878 ?     16.790  ns/op
StackWalkBench.getCallerClass:benchNanoDur      100  avgt   10  1085369.870 ?  87001.613  ns/op
StackWalkBench.getCallerClass                  1000  avgt   10     2143.571 ?     17.697  ns/op
StackWalkBench.getCallerClass:benchNanoDur     1000  avgt   10   868953.256 ? 237607.736  ns/op

The reported benchNanoDur score is much greater than I anticipated.

I think don't have a full grasp of how AuxCounters are calculated or meant to be used. I'd appreciate any hints on what I might be doing wrong. :)

(Or, perhaps, other ideas on how JMH could separate out the call-stack-building-setup for this benchmark, if such a thing is possible.)

Much appreciated,
-Brent

From brent.christian at oracle.com  Fri Aug 25 23:43:52 2023
From: brent.christian at oracle.com (Brent Christian)
Date: Fri, 25 Aug 2023 16:43:52 -0700
Subject: Isolating setup of StackWalker test call stack in JMH?
In-Reply-To: <MN2PR10MB3837E797BF7BCC0B57D4592A9B15A@MN2PR10MB3837.namprd10.prod.outlook.com>
References: <8c066010-4418-b505-b32f-9fe408a8e3c7@oracle.com>
 <MN2PR10MB3837E797BF7BCC0B57D4592A9B15A@MN2PR10MB3837.namprd10.prod.outlook.com>
Message-ID: <02fbfdec-8fcb-3c1e-7103-477a3334dd1f@oracle.com>

Thanks for the response, Sergey.

On 8/15/23 9:52 PM, Sergey Kuksenko wrote:
 > Invoking nanoTime twice per operation, you will get large nanoTime overhead.
 > https://shipilev.net/blog/2014/nanotrusting-nanotime/

That's very interesting.

 > May I ask you - why do you need to move stack construction into "setup"?

I originally wanted StackWalkBench.getCallerClass() to be able to confirm that 
StackWalker.getCallerClass() performs the same, regardless of call stack depth. The benchmark first 
makes method calls to create a call stack depth of `depth` (which I don't want to be part of the 
measurement) before calling getCallerClass().

At this point, StackWalker is mature enough that we're confident that getCallerClass() 
behaves/performs as expected regardless of call stack depth. So in PR 15370[1], the benchmark will 
simplify the benchmark to just call getCallerClass(), without adding to the call stack.

Thanks,
-Brent

1. https://github.com/openjdk/jdk/pull/15370

From sergey.kuksenko at oracle.com  Sat Aug 26 03:58:23 2023
From: sergey.kuksenko at oracle.com (Sergey Kuksenko)
Date: Sat, 26 Aug 2023 03:58:23 +0000
Subject: Isolating setup of StackWalker test call stack in JMH?
In-Reply-To: <02fbfdec-8fcb-3c1e-7103-477a3334dd1f@oracle.com>
References: <8c066010-4418-b505-b32f-9fe408a8e3c7@oracle.com>
 <MN2PR10MB3837E797BF7BCC0B57D4592A9B15A@MN2PR10MB3837.namprd10.prod.outlook.com>
 <02fbfdec-8fcb-3c1e-7103-477a3334dd1f@oracle.com>
Message-ID: <MN2PR10MB38374DB152C238D148C2917E9BE2A@MN2PR10MB3837.namprd10.prod.outlook.com>

You may write a simple baseline benchmark, just constructing call stack.
Something like this:
@Benchmark
    public void forEach_makeCallStack(Blackhole bh) {
        final Blackhole localBH = bh;
        final boolean[] done = {false};
        new TestStack(depth, new Runnable() {
            public void run() {
                done[0] = true;
            }
        }).start();
        if (!done[0]) {
            throw new RuntimeException();
        }
    }

And check the difference between this baseline and the corresponding "StackWalker.getCallerClass()".

________________________________________
From: Brent Christian <brent.christian at oracle.com>
Sent: Friday, August 25, 2023 4:43 PM
To: Sergey Kuksenko; jmh-dev at openjdk.org; jmh-jdk-microbenchmarks-dev at openjdk.org
Subject: Re: Isolating setup of StackWalker test call stack in JMH?

Thanks for the response, Sergey.

On 8/15/23 9:52 PM, Sergey Kuksenko wrote:
 > Invoking nanoTime twice per operation, you will get large nanoTime overhead.
 > https://shipilev.net/blog/2014/nanotrusting-nanotime/

That's very interesting.

 > May I ask you - why do you need to move stack construction into "setup"?

I originally wanted StackWalkBench.getCallerClass() to be able to confirm that
StackWalker.getCallerClass() performs the same, regardless of call stack depth. The benchmark first
makes method calls to create a call stack depth of `depth` (which I don't want to be part of the
measurement) before calling getCallerClass().

At this point, StackWalker is mature enough that we're confident that getCallerClass()
behaves/performs as expected regardless of call stack depth. So in PR 15370[1], the benchmark will
simplify the benchmark to just call getCallerClass(), without adding to the call stack.

Thanks,
-Brent

1. https://github.com/openjdk/jdk/pull/15370