JNI-performance - Is it really that fast?

Wed Mar 26 21:02:35 UTC 2008

>>
> Well, ok, a stop-the-world thing "just" to revoke the bias would be
> really expensive ... not sure if it would be a win even under optimal
> conditions for this use.

It speaks to how (relatively) expensive those atomics are that biased  
locking is profitable even though we might have to occasionally  
perform revocation via stop-the-world safepoints.

<...>

>>
> Well thats an argument I not really thought of and you're right of  
> course.
> Most likely the situations where this could be a win are seldom, and a
> lot of work would be needed to implement and maintain it.
> But don't you think the situation will only get worse the more cores
> the current "design" is stretched to?
> Lets just hope the situation will improve :)

The atomics proper shouldn't be any more expensive on a 256-way than  
on a 2-way.  For the most part they're accomplished locally in the  
cache.   (There was a time when atomics "locked the bus" and impeded  
scalability, but that hasn't been true for many years).  Conceptually  
a compare-and-swap (CAS) or other atomic shouldn't have any more  
impact on the system than a store.   And in fact there was a rough  
relationship between pipeline depth and CAS latency, as most CPUs  
implemented CAS by draining the pipeline, allowing the store buffer to  
drain, and otherwise letting the processor quiesce, as well as killing  
out-of-order execution.  Those are CPU-local effects.  Furthermore  
they're largely an implementation artifact as until recently processor  
designers didn't pay too much attention to atomic latency.  Thankfully  
that appears to be changing and we're seeing more efficient atomics.

>
>
> Thanks a lot for listening and explaining everything that detailed,  
> lg Clemens
>
> PS:
> I did some micro-benchmarking again on my machine again for a JNI- 
> call:
> 180ms - jni per call, no locking
> 240ms - command-buffer(32k), locked JNI call every 1600 calls,
> native-side buffer interpreter (a switch statement)
> 629ms - jni per call, locked
> locking was done with a ReentrantLock.
>
> So the command-buffering and interpreting semms to pay off, although
> on my machine its still slower than a un-locked JNI call.

If I'm interpreting the data correctly, that doesn't seem too  
surprising.   There was a sufficiently long warmup period, and the  
warmup exercised the code in precisely the same way it would execute  
during the benchmark interval?

Regards
Dave