RFR: 8007806: Need a Throwables performance counter

Sun Feb 24 22:18:14 UTC 2013

We've not-so-slightly hijacked Nils' thread here - apologies for that.

On 25/02/2013 8:05 AM, Peter Levart wrote:
>
> Just looked at one way jstat accesses the counters. It runs in a
> separate VM and maps-in a file that is already mapped in the observing
> VM in the direct buffer. It then accesses it via a LongBuffer view (for
> long counters). So there's no synchronization between counter updater
> and counter reader. On ARM v6 jstat could see a "torn" long counter
> then, right?

Right. With current implementation of PerfLongCounter it uses simple 
stores (not atomic ops).

> The double-32bit-CAS updater that I presented would not make it worse
> then on such platforms, I suppose.

No change in tearing abaility.

> On the platforms that support 64bit atomic stores, there are not such
> problems. And I assume those same platforms also support 64bit CAS, or
> are there platforms with 64bit atomic stores and no 64bit CAS?

Most of them actually :) All Java platforms must support atomic 
load/store of 64-bit values to support volatile long and double 
variables. On 32-bit platforms this is done via a range of techniques - 
for example on x86 it is done via the FPU. But these atomic accesses are 
currently restricted to Java volatile field accesses via bytecode - 
there are not exposed via the Unsafe methods, nor are they made 
available via the Atomic:: class in the VM.

Some of these 32-bit platforms also support the 64-bit CAS, which is 
what supports_cx8() is intended to indicate.

If the PerfCounters were supposed to be thread-safe then they might use 
these alternate atomic access operations.

David

> Regards, Peter
>
>>
>> David
>>
>>> Regards, Peter
>>>
>>>>
>>>> David
>>>> -----
>>>>
>>>>> If this is true and it is not that important, then instead of a
>>>>> synchronized update of 64bit counter, a 32bit CAS could be used,
>>>>> optionally (rarely) followed by a second 32bit CAS, like for example:
>>>>>
>>>>> http://dl.dropbox.com/u/101777488/jdk8-tl/PerfCounter/webrev.01/index.html
>>>>>
>>>>>
>>>>>
>>>>> I tried this on ARM v6 and it works much better than synchronized
>>>>> access, but I don't know if it's acceptable. It guarantees eventual
>>>>> correctness of summed value if the only operation performed is
>>>>> add() (no
>>>>> set() intermingled) and has the same possibility of incorrect
>>>>> half-half
>>>>> reads by observers as current PerfCounter has for unsynchronized
>>>>> observers.
>>>>>
>>>>> Here's the comparison of unpatched/patched PerfCounter.increment()
>>>>> micro-benchmark on single-core ARM v6 (Raspbery-PI):
>>>>>
>>>>> *** Original PerfCounter, ARM v6
>>>>>
>>>>> #
>>>>> # PerfCounter_increment: run duration:  5,000 ms, #of logical CPUS: 1
>>>>> #
>>>>>             1 threads, Tavg =    269.34 ns/op (σ =   0.00 ns/op) [
>>>>> 269.34]
>>>>>             2 threads, Tavg =  7,170.48 ns/op (σ = 410.77 ns/op) [
>>>>> 6,783.73,  7,603.95]
>>>>>             3 threads, Tavg = 12,034.82 ns/op (σ = 418.99 ns/op)
>>>>> [11,792.33, 11,714.67, 12,639.26]
>>>>>             4 threads, Tavg = 16,029.76 ns/op (σ = 1,411.44 ns/op)
>>>>> [15,592.04, 18,511.52, 15,642.52, 14,818.16]
>>>>>
>>>>>
>>>>> *** Patched PerfCounter, ARM v6
>>>>>
>>>>> #
>>>>> # PerfCounter_increment: run duration:  5,000 ms, #of logical CPUS: 1
>>>>> #
>>>>>             1 threads, Tavg =    166.21 ns/op (σ =   0.00 ns/op) [
>>>>> 166.21]
>>>>>             2 threads, Tavg =    332.58 ns/op (σ =   0.12 ns/op) [
>>>>> 332.45,    332.70]
>>>>>             3 threads, Tavg =    500.30 ns/op (σ =   0.22 ns/op) [
>>>>> 500.04,    500.29,    500.58]
>>>>>             4 threads, Tavg =    667.95 ns/op (σ =   2.11 ns/op) [
>>>>> 665.22,    667.18,    668.40,    671.04]
>>>>>
>>>>>
>>>>> Regards, Peter
>>>>>
>>>>>
>>>>> On 02/24/2013 11:31 AM, David Holmes wrote:
>>>>>> On 24/02/2013 6:50 PM, Peter Levart wrote:
>>>>>>> Hi David,
>>>>>>>
>>>>>>> I thought it was ok to pass null, but I don't know the "portability"
>>>>>>> issues in-depth. The javadoc for Unsafe says:
>>>>>>>
>>>>>>> /"This method refers to a variable by means of two parameters, and
>>>>>>> so it
>>>>>>> provides (in effect) a double-register addressing mode for Java
>>>>>>> variables. When the object reference is null, this method uses its
>>>>>>> offset as an absolute address. This is similar in operation to
>>>>>>> methods
>>>>>>> such as getInt(long), which provide (in effect) a single-register
>>>>>>> addressing mode for non-Java variables. However, because Java
>>>>>>> variables
>>>>>>> may have a different layout in memory from non-Java variables,
>>>>>>> programmers should not assume that these two addressing modes are
>>>>>>> ever
>>>>>>> equivalent. Also, programmers should remember that offsets from the
>>>>>>> double-register addressing mode cannot be portably confused with
>>>>>>> longs
>>>>>>> used in the single-register addressing mode."/
>>>>>>
>>>>>> That is the doc for getXXX but not for getAndAddXXX or
>>>>>> compareAndSwapXXX. You can't have null here:
>>>>>>
>>>>>> UNSAFE_ENTRY(jboolean, Unsafe_CompareAndSwapLong(JNIEnv *env, jobject
>>>>>> unsafe, jobject obj, jlong offset, jlong e, jlong x))
>>>>>>   UnsafeWrapper("Unsafe_CompareAndSwapLong");
>>>>>>   Handle p (THREAD, JNIHandles::resolve(obj));
>>>>>>   jlong* addr = (jlong*)(index_oop_from_field_offset_long(p(),
>>>>>> offset));
>>>>>>   if (VM_Version::supports_cx8())
>>>>>>     return (jlong)(Atomic::cmpxchg(x, addr, e)) == e;
>>>>>>   else {
>>>>>>     jboolean success = false;
>>>>>>     ObjectLocker ol(p, THREAD);
>>>>>>     if (*addr == e) { *addr = x; success = true; }
>>>>>>     return success;
>>>>>>   }
>>>>>> UNSAFE_END
>>>>>>
>>>>>> David
>>>>>> -----
>>>>>>
>>>>>>
>>>>>>> Does anybody know the in-depth interpretation of the above? Is it
>>>>>>> only
>>>>>>> the particular Java/native type differences (for example,
>>>>>>> endianess of
>>>>>>> variables) that these two addressing modes might interpret
>>>>>>> differently
>>>>>>> or something else too?
>>>>>>>
>>>>>>> Regards, Peter
>>>>>>>
>>>>>>>
>>>>>>> On 02/24/2013 12:39 AM, David Holmes wrote:
>>>>>>>> Peter,
>>>>>>>>
>>>>>>>> In your use of Unsafe you pass "null" as the object. I'm pretty
>>>>>>>> certain you can't pass null here. Unsafe operates on fields or
>>>>>>>> array
>>>>>>>> elements.
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>> On 24/02/2013 5:39 AM, Peter Levart wrote:
>>>>>>>>> Hi Nils,
>>>>>>>>>
>>>>>>>>> If the counters are updated frequently from multiple threads,
>>>>>>>>> there
>>>>>>>>> might be contention/scalability issues. Instead of
>>>>>>>>> synchronization on
>>>>>>>>> updates, you might consider using atomic updates provided by
>>>>>>>>> sun.misc.Unsafe, like for example:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Index: jdk/src/share/classes/sun/misc/PerfCounter.java
>>>>>>>>> ===================================================================
>>>>>>>>>
>>>>>>>>> --- jdk/src/share/classes/sun/misc/PerfCounter.java
>>>>>>>>> +++ jdk/src/share/classes/sun/misc/PerfCounter.java
>>>>>>>>> @@ -25,6 +25,8 @@
>>>>>>>>>
>>>>>>>>>   package sun.misc;
>>>>>>>>>
>>>>>>>>> +import sun.nio.ch.DirectBuffer;
>>>>>>>>> +
>>>>>>>>>   import java.nio.ByteBuffer;
>>>>>>>>>   import java.nio.ByteOrder;
>>>>>>>>>   import java.nio.LongBuffer;
>>>>>>>>> @@ -50,6 +52,8 @@
>>>>>>>>>   public class PerfCounter {
>>>>>>>>>       private static final Perf perf =
>>>>>>>>>           AccessController.doPrivileged(new Perf.GetPerfAction());
>>>>>>>>> +    private static final Unsafe unsafe =
>>>>>>>>> +        Unsafe.getUnsafe();
>>>>>>>>>
>>>>>>>>>       // Must match values defined in
>>>>>>>>> hotspot/src/share/vm/runtime/perfdata.hpp
>>>>>>>>>       private final static int V_Constant  = 1;
>>>>>>>>> @@ -59,12 +63,14 @@
>>>>>>>>>
>>>>>>>>>       private final String name;
>>>>>>>>>       private final LongBuffer lb;
>>>>>>>>> +    private final DirectBuffer db;
>>>>>>>>>
>>>>>>>>>       private PerfCounter(String name, int type) {
>>>>>>>>>           this.name = name;
>>>>>>>>>           ByteBuffer bb = perf.createLong(name, U_None, type, 0L);
>>>>>>>>>           bb.order(ByteOrder.nativeOrder());
>>>>>>>>>           this.lb = bb.asLongBuffer();
>>>>>>>>> +        this.db = bb instanceof DirectBuffer ? (DirectBuffer)
>>>>>>>>> bb :
>>>>>>>>> null;
>>>>>>>>>       }
>>>>>>>>>
>>>>>>>>>       static PerfCounter newPerfCounter(String name) {
>>>>>>>>> @@ -79,23 +85,44 @@
>>>>>>>>>       /**
>>>>>>>>>        * Returns the current value of the perf counter.
>>>>>>>>>        */
>>>>>>>>> -    public synchronized long get() {
>>>>>>>>> +    public long get() {
>>>>>>>>> +        if (db != null) {
>>>>>>>>> +            return unsafe.getLongVolatile(null, db.address());
>>>>>>>>> +        }
>>>>>>>>> +        else {
>>>>>>>>> +            synchronized (this) {
>>>>>>>>> -        return lb.get(0);
>>>>>>>>> -    }
>>>>>>>>> +                return lb.get(0);
>>>>>>>>> +            }
>>>>>>>>> +        }
>>>>>>>>> +    }
>>>>>>>>>
>>>>>>>>>       /**
>>>>>>>>>        * Sets the value of the perf counter to the given newValue.
>>>>>>>>>        */
>>>>>>>>> -    public synchronized void set(long newValue) {
>>>>>>>>> +    public void set(long newValue) {
>>>>>>>>> +        if (db != null) {
>>>>>>>>> +            unsafe.putOrderedLong(null, db.address(), newValue);
>>>>>>>>> +        }
>>>>>>>>> +        else {
>>>>>>>>> +            synchronized (this) {
>>>>>>>>> -        lb.put(0, newValue);
>>>>>>>>> -    }
>>>>>>>>> +                lb.put(0, newValue);
>>>>>>>>> +            }
>>>>>>>>> +        }
>>>>>>>>> +    }
>>>>>>>>>
>>>>>>>>>       /**
>>>>>>>>>        * Adds the given value to the perf counter.
>>>>>>>>>        */
>>>>>>>>> -    public synchronized void add(long value) {
>>>>>>>>> -        long res = get() + value;
>>>>>>>>> +    public void add(long value) {
>>>>>>>>> +        if (db != null) {
>>>>>>>>> +            unsafe.getAndAddLong(null, db.address(), value);
>>>>>>>>> +        }
>>>>>>>>> +        else {
>>>>>>>>> +            synchronized (this) {
>>>>>>>>> +                long res = lb.get(0) + value;
>>>>>>>>> -        lb.put(0, res);
>>>>>>>>> +                lb.put(0, res);
>>>>>>>>> +            }
>>>>>>>>> +        }
>>>>>>>>>       }
>>>>>>>>>
>>>>>>>>>       /**
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Testing the PerfCounter.increment() method in a loop on multiple
>>>>>>>>> threads
>>>>>>>>> sharing the same PerfCounter instance, for example, on a 4-core
>>>>>>>>> Intel i7
>>>>>>>>> machine produces the following results:
>>>>>>>>>
>>>>>>>>> #
>>>>>>>>> # PerfCounter_increment: run duration:  5,000 ms, #of logical
>>>>>>>>> CPUS: 8
>>>>>>>>> #
>>>>>>>>>             1 threads, Tavg =     19.02 ns/op (? = 0.00 ns/op)
>>>>>>>>>             2 threads, Tavg =    109.93 ns/op (? = 6.17 ns/op)
>>>>>>>>>             3 threads, Tavg =    136.64 ns/op (? = 2.99 ns/op)
>>>>>>>>>             4 threads, Tavg =    293.26 ns/op (? = 5.30 ns/op)
>>>>>>>>>             5 threads, Tavg =    316.94 ns/op (? = 6.28 ns/op)
>>>>>>>>>             6 threads, Tavg =    686.96 ns/op (? = 7.09 ns/op)
>>>>>>>>>             7 threads, Tavg =    793.28 ns/op (? = 10.57 ns/op)
>>>>>>>>>             8 threads, Tavg =    898.15 ns/op (? = 14.63 ns/op)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> With the presented patch, the results are a little better:
>>>>>>>>>
>>>>>>>>> #
>>>>>>>>> # PerfCounter_increment: run duration:  5,000 ms, #of logical
>>>>>>>>> CPUS: 8
>>>>>>>>> #
>>>>>>>>> # Measure:
>>>>>>>>>             1 threads, Tavg =      5.22 ns/op (? = 0.00 ns/op)
>>>>>>>>>             2 threads, Tavg =     34.51 ns/op (? = 0.60 ns/op)
>>>>>>>>>             3 threads, Tavg =     54.85 ns/op (? = 1.42 ns/op)
>>>>>>>>>             4 threads, Tavg =     74.67 ns/op (? = 1.71 ns/op)
>>>>>>>>>             5 threads, Tavg =     94.71 ns/op (? = 41.68 ns/op)
>>>>>>>>>             6 threads, Tavg =    114.80 ns/op (? = 32.10 ns/op)
>>>>>>>>>             7 threads, Tavg =    136.70 ns/op (? = 26.80 ns/op)
>>>>>>>>>             8 threads, Tavg =    158.48 ns/op (? = 9.93 ns/op)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The scalability is not much better, but the raw speed is, so it
>>>>>>>>> might
>>>>>>>>> present less contention when used in real-world code. If you
>>>>>>>>> wanted
>>>>>>>>> even
>>>>>>>>> better scalability, there is a new class in JDK8, the
>>>>>>>>> java.util.concurrent.LongAdder. But that doesn't buy atomic
>>>>>>>>> "set()" -
>>>>>>>>> only "add()". And it can't update native-memory variables, so it
>>>>>>>>> could
>>>>>>>>> only be used for add-only counters and in conjunction with a
>>>>>>>>> background
>>>>>>>>> thread that would periodically flush the sum to the native
>>>>>>>>> memory....
>>>>>>>>>
>>>>>>>>> Regards, Peter
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 02/08/2013 06:10 PM, Nils Loodin wrote:
>>>>>>>>>> It would be interesting to know the number of thrown
>>>>>>>>>> throwables in
>>>>>>>>>> the
>>>>>>>>>> JVM, to be able to do some high level application diagnostics /
>>>>>>>>>> statistics. A good way to put this number would be a performance
>>>>>>>>>> counter, since it is accessible both from Java and from the VM.
>>>>>>>>>>
>>>>>>>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8007806
>>>>>>>>>> http://cr.openjdk.java.net/~nloodin/8007806/webrev.00/
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Nils Loodin
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>