JNI faster than Unsafe?

Sun Feb 8 10:12:17 UTC 2015

Hi Robert.

On 02/07/2015 12:46 PM, Robert Stupp wrote:
> I think I found the reason why Unsafe.allocateMemory is slower in
> os::malloc : inc_stat_counter

I think inc_stat_counter has nothing to do with it, it's guarded by
NOT_PRODUCT(...), which means it is stripped from the product VM. C/C++
is funny like that, you can hide an elephant behind a macro, see below.

>> Am 07.02.2015 um 10:02 schrieb Robert Stupp <snazy at snazy.de>:
>> I've setup a single microbenchmark just to know which off-heap
>> memory allocation is faster on Linux + OSX. Candidates were
>> Unsafe.allocateMemory and JNA's Native.malloc(). Both either
>> against "raw" OS and against a preloaded jemalloc (LD_PRELOAD).
>> 
>> My initial suspicion was, that both (Unsafe vs. Native) should
>> perform quite equally since both methods are basically just
>> wrappers for C malloc(). But they don't.

There is a difference though: JNI to VM version of AllocateMemory vs.
JNA to malloc itself. Although JNA should also use a simple JNI stub to
transit from Java to native, the minute details might differ enough.

>> Native.malloc() is much faster compared to
>> Unsafe.allocateMemory(). Depending on the individual
>> microbenchmark, Native.malloc() is up to 3 times faster than
>> Unsafe.allocateMemory().

Let me outline how you can disentangle the reason for the performance
difference like this. It seems a better time investment to dive into the
internals than doing multiple experiments on multiple platforms -- we
get straight to the point, instead of indulging into the modern version
of tasseography.

The flow I followed is brain-dumped here:
  http://cr.openjdk.java.net/~shade/scratch/unsafe-allocate.txt

I'll copy the conclusion here:

1. Unsafe.allocateMemory wastes 20 seconds, before calling to
Unsafe_AllocateMemory. If you look into the disassembly for the
generated code, you will see the preparations for the native call.

2. Unsafe_AllocateMemory wastes 44 seconds, before calling to
os::malloc, HandleMarkCleaner and others. If you look into the
disassembly for this stub, you will see a significant amount of time
spent dealing with doing the actual JNI transition. In the source code,
this is hidden behind the UNSAFE_ENTRY macros in unsafe.cpp:
  UNSAFE_ENTRY(jlong, Unsafe_AllocateMemory(JNIEnv *env, jobject unsafe,
jlong size))

3. os::malloc(unsigned long,MemoryType) wastes 17 seconds before calling
to overloaded version of itself. The disassembly seems to show the
inlined body of CALLER_PC macros that does a few "MemTracker" checks.

4. os::malloc(unsigned long,MemoryType,const NativeCallStack&) wastes
another 20 seconds before finally reaching malloc. The disassembly seems
to show the inlined body of MallocTracker::record_malloc before the call
into the actual
glibc's malloc.

5. HandleMarkCleaner, thread_from_jni_environment waste another 23
seconds for themselves. The disassemblies for them are trivial, and it
does not seem obvious if we can optimize them.

BOTTOM LINE: The overheads of Unsafe.allocateMemory seem to lie in both
handling the actual JNI transition, doing the VM housekeeping, and also
paying the dues for NMT support. If there is a version that can avoid
both costs, it would experience a performance boost. Back-envelope
calculation: saving (20+44+17+20+23)=124 seconds out of 221 seconds for
allocateMemory itself brings the speedup of 221/(221-124) = 2.27x.

(Note it does not explain the 3x difference against JNA, since JNI
transition should also be involved here. But, using this flow, you can
take the configuration where you had observed the difference, and
dissect it).

(Note #2: I'll let runtime/NMT folks to figure if NMT should provide
less overhead here).

Thanks,
-Aleksey.