Profiling interpreter vs. compiled time + proof of concept

Tue Apr 19 18:39:57 UTC 2016

This is a nice project but right now I don’t see the value in this.  If a method is hot it will get JIT compiled anyway so there is no real value in knowing how much time was spent interpreting it.  But maybe I am missing something.

One data point I’m curious about is if -Xprof which uses sampling to figure out how much time was spent in interpreted and compiled code gives about the same results as your precise approach.

> On Apr 10, 2016, at 11:02 AM, Adrian <withoutpointk at gmail.com> wrote:
> 
> Hello,
> 
> I am a student at the University of Toronto, working on a research
> project studying JVM performance with a focus on distributed systems.
> As part of the study, we want to know the time the JVM spends
> interpreting code, and the time it spends in compiled (JNI/native or
> jitted) code.
> For example, for a program repeating a task, you can see the
> interpreter time for each iteration decrease.
> A timer for interpreted and compiled code can be used to
> finely-grained see the performance of the jit (warmup time +
> improvement over interpreter).
> 
> Interpreter -> native, and native -> interpreted calls are relatively
> easy to track and time as there are clear boundaries.
> However, it’s tricky to track transitions between interpreted and
> jitted code, especially with complications such as OSR and
> deoptimization.
> 
> We have an implementation for the amd64 linux build, which is described below.
> Before we solved all the issues, we often encountered crashes.
> Of course, we are not JVM experts and may be missing other edge cases.
> Regardless, it has been working on our workloads such as reading on
> HDFS or running queries on Spark.
> As these are fairly complex workloads, we are confident it is
> successful at least as a proof of concept.
> 
> We have found this instrumentation to have negligible overhead (less
> than the inherent variance between multiple runs).
> Furthermore, because most of the instrumentation is done in code blobs
> generated by the dynamic
> Assembler/MacroAssembler/InterpreterMacroAssembler, when
> instrumentation is disabled, it's essentially non-existent.
> 
> Our modifications will be open sourced, but we wanted to know if
> OpenJDK developers would be interested in adding something like this
> in the future.
> I've attached a patch which should be up to date with the latest revision.
> I've also uploaded the patches and a prebuilt binary (linux amd64):
> http://research.dfr.moe/jvm_profiling/
> 
> You enable profiling with the flag -XX:+ProfileIntComp, and get the
> time in nanoseconds inside a Java program from a java.lang.Thread
> object:
> Thread th = Thread.currentThread();
> th.resetIntCompTimes();
> long a = th.getIntTime();
> long b = th.getCompTime();
> 
> The times include a bit of JVM initialization (we have not found a
> perfect solution for this), so for short running programs you should
> call `resetIntCompTimes` at the start of `main`.
> 
> Regarding the code, it is currently quite hacky because this is my and
> my team's first time really modifying the JVM.
> I was not familiar with the code structure, and tried to keep most of
> my additions in 1 file (sharedRuntime.cpp) instead of worrying about
> modularizing stuff.
> For example, thread local data is added to the JavaThread class, and
> the data is exposed in the java.lang.Thread class.
> If OpenJDK developers are interested in this, we are happy to continue
> working on this to make it consistent with the proper JVM structure
> (would need feedback/pointers).
> 
> We are using it for our research and fixed many problems we
> encountered, but if anyone encounters anything or knows of any
> potential problems, any feedback would be greatly appreciated.
> 
> Below is a high level description of the implementation (specific to
> linux 64 bit).
> ---
> The goal is to track transitions between interpreted, native, and
> jitted (standard + OSR compilation) code so we can precisely track the
> time the JVM spends in each type (currently, native and jitted code
> are lumped as "compiled" code, but it would be trivial to separate
> them), and the challenge is with jitted code.
> 
> Interpreted calls pass all arguments on the stack, whereas the
> compiled calling convention uses 6 registers for integer/pointer
> arguments.
> There are adapters - generated in `sharedRuntime_x86_64.cpp`.
> Methods with the same signature share adapters, which move arguments
> from the stack into registers (i2c) or from registers onto the stack
> (c2i).
> It is easy to add instrumentation to these adapters to track a "call
> transition".
> 
> When the callee returns, we need a "return transition" back to the
> caller's state.
> However, there is no existing place we can do this, as callees merely
> pop their stack frame and jump to the return address.
> Our solution is to save the "real return address" in the adapter, and
> replace the return address for the new frame with a "return handler"
> address (one for i2c, one for c2i).
> We also save the location of the return address (where on the
> stack/address in memory), for reasons explained later.
> This data is saved in the JavaThread object.
> We require a stack of return addresses as we could go i2c -> c2i ->
> i2c and so on.
> 
> Because the last thing a compiled function does is pop the return
> address and jump to it, for i2c we end up in the "i2c return handler"
> and the stack pointer is 1 word above where the return address was.
> As a sanity check, we can verify it matches the "expected location"
> (what we saved earlier).
> We track the transition then jump to the "real/original return
> address" which we have also saved.
> 
> When we first did this, we encountered many crashes.
> Many JVM operations rely on the return address to identify callers, such as:
> - figuring out the caller when a function is deoptimized
> (SharedRuntime::handle_wrong_method)
> - getting a call trace (java_lang_Throwable::fill_in_stack_trace)
> - finding “activations” of compiled functions on the stack
> (NMethodSweeper::mark_active_nmethods)
> - checking permissions for AccessController (JVM_GetStackAccessControlContext)
> 
> After identifying these "operations" which require examining the
> stack, we can "undo" all our changes to return addresses before any of
> it happens.
> We could use the `frame` code to walk the stack, but we also saved the
> return-address-locations earlier; we can replace each return address
> (which should now be an i2c/c2i return handler address) with the
> real/original return address.
> When we do this, there is no evidence our instrumentation took place.
> We redo all our changes after these operations are done so that when
> the thread continues executing, we'll continue tracking transitions.
> A lot of this happens in safepoints, e.g. marking active nmethods; we
> unpatch/repatch at the start and end of a safepoint to handle many
> scenarios.
> 
> Java exceptions also need to be handled.
> If a function does not catch an exception, it has "finished
> executing"; we will pop the stack frame and see if the caller has a
> handler.
> It's necessary to check for a return handler address to track a
> transition back to the caller's state (interpreted/native/compiled),
> and also to identify the caller using the real/original return
> address.
> 
> Lastly, deoptimization also requires careful treatment.
> The jit compiler does not necessarily compile all paths for a function.
> If it hits an unexpected path, it can end up in a “deopt blob” (for C1
> compiled functions) or “uncommon trap blob” (for C2 compiled
> functions) - these are generated in `sharedRuntime_x86_64.cpp`.
> The two routines are the same to us - they replace the compiled frame
> with an interpreted one, and continue in the interpreter.
> 
> If the deoptee’s caller was interpreted, we must have had an i2c
> transition (we should see the i2c return handler’s address on the
> stack).
> Since we’re executing the rest of the function as interpreted, it’s
> the same as if the callee returned to the interpreted caller.
> We need to replace the return address on the stack with the
> original/real one, and track a transition back to interpreted code -
> we simulate the "i2c return transition".
> 
> If the deoptee’s caller was compiled, there was no transition earlier
> (it was c2c).
> However, we’re continuing execution as interpreted, and will
> eventually return to the compiled caller.
> We simulate a c2i call; we transition to interpreted code (as with an
> interpreted caller), but also save the current/real return address and
> replace the one on the stack with the c2i return handler's address.
> When the callee returns, we will track the transition back to compiled code.
> 
> C2i return is actually more complicated if we want to do sanity checks.
> As explained above, we can check that the stack pointer is a sane
> value matching our recorded location for the return address.
> However, interpreter frames have both a base pointer (rbp) pointing to
> the "rsp on callee entry", and a "sender sp" (usually r13) for the
> "rsp on caller exit".
> A simple example of why this is required is the c2i adapter itself.
> Because it moves register-arguments onto the stack, it needs to
> allocate space on the stack.
> Sender sp/r13 points to the stack pointer at the start of the adapter
> before doing anything.
> Rbp points to the stack pointer on interpreted method entry, after the
> adapter has done the shuffling.
> The return address is 1 word above rbp.
> 
> When an interpreted method returns, it does something like:
> # restore callee saved registers
> mov rsp <- rbp (reset frame)
> pop rbp (restore base pointer)
> pop r11 (return address into some temporary register - r11 is caller saved)
> mov rsp <- r13 (restore stack pointer to sender sp - r13 is callee
> saved and would have been restored above)
> jmp r11 (jump to return address)
> 
> Therefore, by the time you end up in the c2i return handler, you don't
> know where the return address was.
> To verify the return address location, we had to find all places the
> interpreter implemented a method return (there are various cases), and
> manually check the return address right after picking it up from the
> stack, before rsp is set to the sender sp.
> 
> ---
> I found code for JVMTI where it gets events for every interpreted and
> native entry.
> As a sanity check, enabled by the flag -XX:+ProfileIntCompStrict, on
> every interpreted method entry, it checks if the state recorded
> actually is interpreted.
> 
> There are some Java functions the JVM specifically makes calls to (a
> lot of stuff related to classloading, e.g.
> `ClassLoader#checkPackageAccess`).
> These n2i transitions are manually tracked.
> We found these locations with the sanity check above.
> 
> We do not see this check failing in our workloads currently.
> 
> ---
> Hopefully this high level overview makes sense.
> Comments in the code give more details regarding specific scenarios,
> such as handling OSR (which is conceptually similar).
> I’d be happy to answer any questions or explain anything.
> 
> Any feedback is appreciated.
> Thank you for your time!
> 
> Best regards,
> Adrian