How to write JMH microbenchmarks for asynchronous functions (tips and tricks).

Tue May 9 01:14:01 UTC 2023

Now, when Project Loom is reaching its maturity, it is raised the question: how to write reliable and simple microbenchmarks for virtual threads (or asynchronous frameworks) using JMH.

1)
Recently JMH got a new feature - the ability to execute benchmark in virtual threads. It can be turned on by specifying property "-Djmh.executor=VIRTUAL_TPE". All other threading options (-t <number of threads>, -tg <num threads inRecently JMH got a new feature - the ability to execute benchmark in virtual threads. It can be turned on by specifying the property "-Djmh.executor=VIRTUAL_TPE". All other threading options (-t <number of threads>, -tg <num threads in group>,<>) remain the same.
The feature will be widely available when JMH 1.37 goes to maven central. Until that, you may download and build your local version of JMH.

The feature covers only a subset of desired scenarios - long-running virtual threads. In JMH threading architecture, the benchmark running thread (it doesn't matter platform or virtual) exists during the whole iteration run (mainly the whole benchmark run). The feature can't help us in benchmarking short-lived virtual threads and benchmarking asynchronous frameworks, which are natural competitors of project Loom.

2)
An asynchronous function is a function that returns Future/CompletableFuture or accepts "onComplete" callback.
A practical evaluation of different benchmarking schemes for asynchronous functions was done. All results and conclusions are empirical and could change in future hardware/software evolution.

The following schemes were evaluated:

2.1 Naive approach

  @Benchmark
  public void bench() {
      Future<?> f = async_function(...);
      f.get();
  }

2.2 Hammock or batch approach (how we benchmark some async operation in existing microbenchmark corpus).

  @Benchmark
  @OperationsPerInvocation(SIZE)
  public void hammock() {
      Future<?>[] futures = new Future[SIZE];
      for(int i=0; i < SIZE; i++) {
          futures[i] = async_function(...);
      }
      for(Future f : futures) {
          f.get();
      }
  }

This approach has a theoretical flaw - unbalanced work. There are explicit phases of submitting, actual work, and results waiting.
It was made in an attempt to create a more stable scheme:

2.3 "OnFly" approach.
  The key idea here - is to count processing and finished operations and limit the submission rate with the fixed amount of currently processing operations.
  It requires JMH modification, and different implementations were tested.

Note:
  There is another approach widely used in large benchmarks. When submitting thread works with a fixed "injection rate." In this case, the result of the benchmark is latency distribution.
  This way works perfectly with large benchmarks, but it has some "ideological" discrepancies with current JMH microbenchmarks. It is desired to have a simple result metric that can be represented as a single number.
  That is why this approach was not considered.

If someone has another idea on how to write microbenchmarks for asynchronous functions - please, share your ideas. It is worth trying.

Which metric should be used?
JMH has two main metrics: throughput and average time. Average time should be avoided. JMH computes the average time from throughput using simple arithmetic, depending on the number of working (benchmark) threads. Since we don't know the concurrency of asynchronous operation - that may produce inconsistent, meaningless results.

Two criteria were used for choosing the better approach:
 1. Variance minimization. The key criteria. We want to get consistent and repeatable results.
 2. Higher throughput metric. That leads to higher system utilization and gives a high benchmarked operation impact that simplifies analysis and regression tracking.
  e.g., If some benchmark has 3% variance and 30% impact of benchmarked operation - that means all regressions/changes of the operation which less than 10% (for the operation itself) are invisible/undetectable on the benchmark.

There are some benchmarks "classification":
- nanobenchmarks - time of the operation expressed in nanoseconds
- microbenchmark - .. in microseconds
- millibenchmarks - ... in milliseconds
- benchmarks - ... in seconds

Going forward:
 - It doesn't matter how to write benchmarks for "benchmarks." All ways (including naive) are consistent and give the same results.
 - There is no way to do precise benchmarks for nanobenchmarks (asynchronous operation). Infrastructure overhead is too high, and system utilization is too low. It doesn't mean that such microbenchmarks have no sense. This fact should be kept in mind, and nanobenchmarks require careful analysis.

Let's talk about micro and millibenchmarks. The best approach is the hammock.

1. Waiting operations completion.
  Hammock benchmark has a "waiting" loop at the end :
      for(Future f : futures) {
          f.get();
      }
  Some of the checked futures are already completed when checked, and some are not and cause blocking/parking benchmarking thread. A number of such benchmarking thread parkings has a very high impact on micro and millibenchmarks variance. Two smaller parkings give worse results than one big park operation. Ideally, we need only the single park until all operations are complete.

  - The best (minimum variance and overhead) way is using CountDownLatch, which works when asynchronous operation accepts "onComplete" callback or returns CompletableFuture. Benchmark looks like:
    @Benchmark
    public void hammock() throws InterruptedException {
        CountDownLatch latch = new CountDownLatch(SIZE);
        for (int i = 0; i < SIZE; i++) {
           async_function(..., () -> latch.countDown());
        }
        latch.await();
    }

  - If async_function returns CompletableFuture/CompletionStage it is possible to use single blocking the following way:

    CompletableFuture.allOf(futures).join();

    Behavior is the same as using CountDownLatch.

  - If old Java's Future is used, the "Future.get" operations should be invoked in reverse order:

    @Benchmark
    public void hammock() throws InterruptedException {
        Future<?>[] futures = new Future[SIZE];
        for (int i = 0; i < size; i++) {
            futures[i] = async_function(...);
        }
        for (int i = futures.length - 1; i >= 0; i--) {
            futures[i].get();
        }
    }

    There is a high probability that async operations will be completed in the same order as submitted. By blocking in reverse order, we decrease the number of park operations and minimize variance. (proved on practice).

2. Number of benchmark threads.
   For benchmarking async functions, benchmark threads are submitting and waiting threads. Real work is done somewhere else.
   As it was mentioned before hammock approach has a theoretical flaw - unbalanced work. To mitigate it, the number of benchmark threads should be 2 or more. Please, never execute asynchronous benchmarks with the single benchmark thread.
   This can be set up as JMH option "-t <num of threads>" of "@Threads" annotation.

   Empirical results show that the best number of submitting threads (especially for microbenchmarks) depends on hardware - "<number of hardware threads>/2".

3. Concurrency.

   Let's consider "<number of benchmark/submitting threads>*<batch SIZE>" as the benchmark concurrency.
   Experiments have shown that benchmark performance strictly depends on concurrency. Different number of benchmark threads and batch sizes but the same concurrency will give the same result.
   e.g., 32 submitting threads + SIZE=64 gives the similar result as 16 submitting threads + SIZE=128

   Empirical results show the best concurrency value is "32*<number of hardware threads>". This value works well for up to 10-microsecond operations. For smaller operations, it makes sense to increase the concurrency.

   When "<number of hardware threads>/2" benchmark threads are used, the optimal SIZE value is 64.

   For microbenchmarks and millibenchmarks, larger SIZE is also may be used. However, when the time of the single operation is larger than 100 milliseconds, the SIZE value larger than 100 causes high variance.

Thank you. Comments are welcome.
Sergey Kuksenko