loop customization: a key challenge

Mon Sep 10 10:28:00 PDT 2012

Aleksey's note about chain11 vs. chain12 jogged my memory about something that has been worrying me since 2005.

There is an important optimization challenge coming up when lambdas are combined (as they are intended to be combined) with bulk data processing libraries.  I call this optimization challenge "loop customization".

(Can anyone propose a previously coined, commonly used term instead of loop customization?  Please read on…)

Most loop optimizations take a source code loop as a compiler work unit and optimize it for general use.  Internally, the loop may be replicated and specialized, and the various specialized versions composed to implement the full contract of the original loop code.  For example, iteration range splitting is used by HotSpot server compiler to divide a loop in to pre-loop, main loop, and post-loop, each with its own job to do.  The main loop runs full speed and is unrolled and maybe vectorized. The other loop versions perform array index checking in such a way that the main loop does not have it.

But with closures and iteration frameworks, loops need to be sensitive to their callers; a compiler may need to create loop versions which are adapted to individual callers.  A "customizable loop" is one which it may be profitable to "customize" differently for distinct callers.  More specifically, the inner actions of a customizable loop may vary greatly from caller to caller.  Here is pseudocode for a customizable loop that performs a "for-each" traversal of some data set:

  L(input : Array(T), kernel : Function(T -> U), output : Array(U)) {
    for (element : T in input)  output.collect(kernel(element));
  }
  A() { L(data1, X, result1); }
  B() { L(data2, Y, result2); }
  C() { L(data3, Z, result3); }

In terms of call graphs, a customizable loop looks like a pinch point L:

A -> L -> X
B -> L -> Y
C -> L -> Z
…

L will have some code shape that makes us want to compile and optimize L distinctly for each callee (X,Y,Z).  Usually, L will call X (or Y or Z) many times.  We will want to apply loop optimizations such as invariant hoisting, iteration range splitting, unrolling, and vectorization.

In the typical case, L is a reusable loop superstructure, such as ArrayList::filter.  (Or some lazy-buffered-blocked-array-stream::filter.)  Each caller (A,B,C) is using a collection utility (or more generally a whole pipeline of them).  Each caller (A,B,C) supplies a distinct loop kernel (X,Y,Z) potentially unique to that caller.

After node merging (in the method/bytecode representation of the program), our attempts to profile L are spoiled:

{A,B,C} -> L -> {X,Y,Z}

It was wonderful that L was reusable software, but now we wish L had been copied once per call site.  As it is, the call from L (to the loop kernel) is badly megamorphic.  So the calls to X,Y,Z are not inlined properly, if L is not completely inlined.  (Another hazard is that type information identifying X must get propagated through the customized copy of L in A, etc.)

(The parallel with Aleksey's chain11 is that A->L->X inlines like D1->D1->E.  Here, the middle instance of D1 is contextually monomorphic but appears polymorphic in the profile.)

To avoid polymorphic (or bimorphic) calls, each separate call to L must get inlined, with fully customized copy of the loop L integrated with each kernel (X,Y,Z).  We try to make this happen, and will probably succeed as long as L is simple enough in structure.  Brian Goetz has proposed a closure-savvy generalization of AtomicIntegerFieldUpdater.getAndIncrement as a canonical test case.  There's a very tiny CAS loop inside the combinator.

Because pipelines can contain repeated calls to the same collection transform, there may also be apparent recursions from X to one of {A,B,C}, etc.  This is a further parallel with chain11 (D1/D2/E), since the recursion heuristics may kick in and further obscure the static pipeline structure.

The biggest challenge with loop customization happens when L is complex.  We will soon enough confront the case where L is not just a for-loop but in fact is a large slice of a fork-join algorithm, concerned with the problem of distributing the loop across multiple CPUs with shared memory.  The parts of L that require customization will be the loops L1 over sections of the partitioned input data, and they will be called on multiple threads, far away (in the compiler's view) from the entry point L0 of the fork-join library.

{A,B,C} -> L0 -> …(thread stuff)… -> L1 -> {X,Y,Z}

Inlining will not even get close to customizing those inner loops L1.  In essence, the logically simple loop L has been split into a nested loop L0/L1, where the outer loop L0 might be very complex (a task scheduler) and difficult to analyze statically.

(We may assume that L1 can be kept simple, a visit to a slice of an input array or two.  If that is the case, it is here that we want to concentrate our code optimization work.)

What is needed here (in my opinion) is a way for L0 to participate, explicitly, under expert programmer control, in the instantiation or customization of L1 for various X,Y,X.  We want the benefits of C++ templates (copy outer loop and inline like you mean it) without the deficits (premature optimization due to exhaustive inlining, followed by I-cache heat death).

Luckily, closures (and/or method handles) will give us a natural notation for such selective inlining.

In the "old days" (before closures), the only workable option for L0 to combine L1 with X,Y,Z was to call L1 with X,Y,Z as a parameter, triggering inlining and constant folding.  In other words, code customization is triggered only at call sites (with the help of inlining and profiling).

With closures, there is another option for triggering customization, and that is partial application. [1]  L0 should create a customized inner loop L1+X, not by eventually calling X from L1 inside each worker thread, but by partially applying L1 to X at top level, before splitting the workload.  Then the call chain looks like this:

{A,B,C} -> L0 -> {L1_X = L1->X, L1_Y = L1->Y, …} -> …(thread stuff)… L2 -> {L1_X, L1_Y, L1_Z}

In pseudo-code:
  L0(input : Array(T), kernel : Function(T -> U), output : Array(U)) {
    L1_k = L1.bindTo(kernel);
    distributeAcrossWorkersCallingL2(input, L1_k, output);
  }
  L1(kernel : Function(T -> U),
        input : SubArray(T), output : SubArray(U)) {
    for (element : T in input)  output.collect(kernel(element));
  }
  L2(input : SubArray(T), loop_plus_kernel : Function(SubArray(T) -> SubArray(U)), output : SubArray(U)) {
    loop_plus_kernel(input, output);
  }

L0, L1, L2 are private components of a library which presents itself as a simple bulk data operator L.

Given this program structure (which requires some conspiracy with library implementors), the JVM can inline the first two levels of calls, and note the creation of the customized loop.  The customized loop can then be distributed to worker threads and executed at full (customized) performance.  Instances of a customized loop, although not explicit in the source code, would look something like this:

  L1_X(input : SubArray(T), output : SubArray(U)) {
    for (element : T in input)  output.collect(X(element));
  } //  generated by L1.bindTo(X)
  L1_Y(...) { for (…)  ...(Y(...)); } //  generated by L1.bindTo(Y)
  L1_Z(...) { for (…)  …(Z(...)); } //  generated by L1.bindTo(Z)
  ...

This will require just-in-time generation of L1_X, L1_Y, etc., as they are successively created by L0.  (Global static analysis might work, if the program is well behaved enough.  But it is better to aim at the on-line solution first.)

There are two major variations to point out here.  In the simple variation, A,B,C are in 1-1 correspondence with X,Y,Z.  Customizing the various L1_X could be driven by inlining decisions at A,B,C.  In the second variation, a very simple loop kernel X might be used many times, by many A1, A2, A3, ….  (This might happen with List::sum; the loop kernel is a closure that adds two numbers together.)  Customizing L1_X may profit from a dynamic registry that somehow recognizes and reuses L1_X from many places (A1,A2,A3,…).

— John

P.S.  The method-handle (JSR 292) way of expressing partial application is MH::bindTo.  With closures, a little more pattern matching is required, but in essence you look for the programmer wrapping a call to a non-constant but hoistable SAM object X inside a closure L1.  Profiling is required in all cases to concentrate optimization effort where it counts.

[1] http://en.wikipedia.org/wiki/Partial_application