RFR: 8007806: Need a Throwables performance counter

Sun Feb 24 22:05:09 UTC 2013

On 02/24/2013 10:48 PM, David Holmes wrote:
> Peter,
>
> On 25/02/2013 7:25 AM, Peter Levart wrote:
>>
>> On 02/24/2013 09:57 PM, David Holmes wrote:
>>> On 25/02/2013 6:18 AM, Peter Levart wrote:
>>>> Hi Alan, David, Nils,
>>>>
>>>> I just want to clear something regarding PerfCounter implementation.
>>>>
>>>> Access to 64bit value in native memory is through a direct buffer 
>>>> which
>>>> uses normal read/write (non-volatile, Unsafe.[get|set]Long). So on
>>>> processors that don't support atomic 64bit stores/loads, each access
>>>> results in two separate 32bit load/store accesses right?
>>>
>>> Unsafe.get|setLong uses locking on those platforms.
>>
>> Even if it does, it is important whether "all" accesses to this 64bit
>> value are using locking and whether they are using the same lock. Aren't
>> performance counters JVM native variables where just some of them happen
>> to be updated from Java?
>
> AFAICS PerfCounters have no thread-safety properties or guarantees. So 
> it is up to the user of each counter to use it in an appropriate way. 
> I think the serviceability folk would have to chime in here as to how 
> PerfCounters are supposed to be used.
>
>>>> The PerfCounter methods that access the 64bit value are synchronized,
>>>> using PerfCounter instance as a lock. But how is this 64bit value
>>>> accessed for example in the jstat utility? Is it possible that 
>>>> jstat can
>>>> see one half of the 64bit value before the update and the other half
>>>> after the update?
>>>
>>> Does jstat access these values directly or only via the synchronized
>>> methods? If the latter then the value can't be "torn" that way. The
>>> sync method will store the value in 2 32-bit registers, and the
>>> variable load in jstat will take two instructions, but nothing can
>>> touch those registers.
>>
>> I'm not saying that the value could be corrupted in any way, just that
>> the unsynchronized observer (like jstat) could see it "torn" sometimes.
>
> If the value is initially read in a sync block and all updates are 
> also synchronized, then I don't think it can. But you need to look at 
> actual code to determine this.

Just looked at one way jstat accesses the counters. It runs in a 
separate VM and maps-in a file that is already mapped in the observing 
VM in the direct buffer. It then accesses it via a LongBuffer view (for 
long counters). So there's no synchronization between counter updater 
and counter reader. On ARM v6 jstat could see a "torn" long counter 
then, right?

The double-32bit-CAS updater that I presented would not make it worse 
then on such platforms, I suppose.

On the platforms that support 64bit atomic stores, there are not such 
problems. And I assume those same platforms also support 64bit CAS, or 
are there platforms with 64bit atomic stores and no 64bit CAS?

Regards, Peter

>
> David
>
>> Regards, Peter
>>
>>>
>>> David
>>> -----
>>>
>>>> If this is true and it is not that important, then instead of a
>>>> synchronized update of 64bit counter, a 32bit CAS could be used,
>>>> optionally (rarely) followed by a second 32bit CAS, like for example:
>>>>
>>>> http://dl.dropbox.com/u/101777488/jdk8-tl/PerfCounter/webrev.01/index.html 
>>>>
>>>>
>>>>
>>>> I tried this on ARM v6 and it works much better than synchronized
>>>> access, but I don't know if it's acceptable. It guarantees eventual
>>>> correctness of summed value if the only operation performed is 
>>>> add() (no
>>>> set() intermingled) and has the same possibility of incorrect 
>>>> half-half
>>>> reads by observers as current PerfCounter has for unsynchronized
>>>> observers.
>>>>
>>>> Here's the comparison of unpatched/patched PerfCounter.increment()
>>>> micro-benchmark on single-core ARM v6 (Raspbery-PI):
>>>>
>>>> *** Original PerfCounter, ARM v6
>>>>
>>>> #
>>>> # PerfCounter_increment: run duration:  5,000 ms, #of logical CPUS: 1
>>>> #
>>>>             1 threads, Tavg =    269.34 ns/op (σ =   0.00 ns/op) [
>>>> 269.34]
>>>>             2 threads, Tavg =  7,170.48 ns/op (σ = 410.77 ns/op) [
>>>> 6,783.73,  7,603.95]
>>>>             3 threads, Tavg = 12,034.82 ns/op (σ = 418.99 ns/op)
>>>> [11,792.33, 11,714.67, 12,639.26]
>>>>             4 threads, Tavg = 16,029.76 ns/op (σ = 1,411.44 ns/op)
>>>> [15,592.04, 18,511.52, 15,642.52, 14,818.16]
>>>>
>>>>
>>>> *** Patched PerfCounter, ARM v6
>>>>
>>>> #
>>>> # PerfCounter_increment: run duration:  5,000 ms, #of logical CPUS: 1
>>>> #
>>>>             1 threads, Tavg =    166.21 ns/op (σ =   0.00 ns/op) [
>>>> 166.21]
>>>>             2 threads, Tavg =    332.58 ns/op (σ =   0.12 ns/op) [
>>>> 332.45,    332.70]
>>>>             3 threads, Tavg =    500.30 ns/op (σ =   0.22 ns/op) [
>>>> 500.04,    500.29,    500.58]
>>>>             4 threads, Tavg =    667.95 ns/op (σ =   2.11 ns/op) [
>>>> 665.22,    667.18,    668.40,    671.04]
>>>>
>>>>
>>>> Regards, Peter
>>>>
>>>>
>>>> On 02/24/2013 11:31 AM, David Holmes wrote:
>>>>> On 24/02/2013 6:50 PM, Peter Levart wrote:
>>>>>> Hi David,
>>>>>>
>>>>>> I thought it was ok to pass null, but I don't know the "portability"
>>>>>> issues in-depth. The javadoc for Unsafe says:
>>>>>>
>>>>>> /"This method refers to a variable by means of two parameters, and
>>>>>> so it
>>>>>> provides (in effect) a double-register addressing mode for Java
>>>>>> variables. When the object reference is null, this method uses its
>>>>>> offset as an absolute address. This is similar in operation to 
>>>>>> methods
>>>>>> such as getInt(long), which provide (in effect) a single-register
>>>>>> addressing mode for non-Java variables. However, because Java
>>>>>> variables
>>>>>> may have a different layout in memory from non-Java variables,
>>>>>> programmers should not assume that these two addressing modes are 
>>>>>> ever
>>>>>> equivalent. Also, programmers should remember that offsets from the
>>>>>> double-register addressing mode cannot be portably confused with 
>>>>>> longs
>>>>>> used in the single-register addressing mode."/
>>>>>
>>>>> That is the doc for getXXX but not for getAndAddXXX or
>>>>> compareAndSwapXXX. You can't have null here:
>>>>>
>>>>> UNSAFE_ENTRY(jboolean, Unsafe_CompareAndSwapLong(JNIEnv *env, jobject
>>>>> unsafe, jobject obj, jlong offset, jlong e, jlong x))
>>>>>   UnsafeWrapper("Unsafe_CompareAndSwapLong");
>>>>>   Handle p (THREAD, JNIHandles::resolve(obj));
>>>>>   jlong* addr = (jlong*)(index_oop_from_field_offset_long(p(),
>>>>> offset));
>>>>>   if (VM_Version::supports_cx8())
>>>>>     return (jlong)(Atomic::cmpxchg(x, addr, e)) == e;
>>>>>   else {
>>>>>     jboolean success = false;
>>>>>     ObjectLocker ol(p, THREAD);
>>>>>     if (*addr == e) { *addr = x; success = true; }
>>>>>     return success;
>>>>>   }
>>>>> UNSAFE_END
>>>>>
>>>>> David
>>>>> -----
>>>>>
>>>>>
>>>>>> Does anybody know the in-depth interpretation of the above? Is it 
>>>>>> only
>>>>>> the particular Java/native type differences (for example, 
>>>>>> endianess of
>>>>>> variables) that these two addressing modes might interpret 
>>>>>> differently
>>>>>> or something else too?
>>>>>>
>>>>>> Regards, Peter
>>>>>>
>>>>>>
>>>>>> On 02/24/2013 12:39 AM, David Holmes wrote:
>>>>>>> Peter,
>>>>>>>
>>>>>>> In your use of Unsafe you pass "null" as the object. I'm pretty
>>>>>>> certain you can't pass null here. Unsafe operates on fields or 
>>>>>>> array
>>>>>>> elements.
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>> On 24/02/2013 5:39 AM, Peter Levart wrote:
>>>>>>>> Hi Nils,
>>>>>>>>
>>>>>>>> If the counters are updated frequently from multiple threads, 
>>>>>>>> there
>>>>>>>> might be contention/scalability issues. Instead of
>>>>>>>> synchronization on
>>>>>>>> updates, you might consider using atomic updates provided by
>>>>>>>> sun.misc.Unsafe, like for example:
>>>>>>>>
>>>>>>>>
>>>>>>>> Index: jdk/src/share/classes/sun/misc/PerfCounter.java
>>>>>>>> =================================================================== 
>>>>>>>>
>>>>>>>> --- jdk/src/share/classes/sun/misc/PerfCounter.java
>>>>>>>> +++ jdk/src/share/classes/sun/misc/PerfCounter.java
>>>>>>>> @@ -25,6 +25,8 @@
>>>>>>>>
>>>>>>>>   package sun.misc;
>>>>>>>>
>>>>>>>> +import sun.nio.ch.DirectBuffer;
>>>>>>>> +
>>>>>>>>   import java.nio.ByteBuffer;
>>>>>>>>   import java.nio.ByteOrder;
>>>>>>>>   import java.nio.LongBuffer;
>>>>>>>> @@ -50,6 +52,8 @@
>>>>>>>>   public class PerfCounter {
>>>>>>>>       private static final Perf perf =
>>>>>>>>           AccessController.doPrivileged(new Perf.GetPerfAction());
>>>>>>>> +    private static final Unsafe unsafe =
>>>>>>>> +        Unsafe.getUnsafe();
>>>>>>>>
>>>>>>>>       // Must match values defined in
>>>>>>>> hotspot/src/share/vm/runtime/perfdata.hpp
>>>>>>>>       private final static int V_Constant  = 1;
>>>>>>>> @@ -59,12 +63,14 @@
>>>>>>>>
>>>>>>>>       private final String name;
>>>>>>>>       private final LongBuffer lb;
>>>>>>>> +    private final DirectBuffer db;
>>>>>>>>
>>>>>>>>       private PerfCounter(String name, int type) {
>>>>>>>>           this.name = name;
>>>>>>>>           ByteBuffer bb = perf.createLong(name, U_None, type, 0L);
>>>>>>>>           bb.order(ByteOrder.nativeOrder());
>>>>>>>>           this.lb = bb.asLongBuffer();
>>>>>>>> +        this.db = bb instanceof DirectBuffer ? (DirectBuffer) 
>>>>>>>> bb :
>>>>>>>> null;
>>>>>>>>       }
>>>>>>>>
>>>>>>>>       static PerfCounter newPerfCounter(String name) {
>>>>>>>> @@ -79,23 +85,44 @@
>>>>>>>>       /**
>>>>>>>>        * Returns the current value of the perf counter.
>>>>>>>>        */
>>>>>>>> -    public synchronized long get() {
>>>>>>>> +    public long get() {
>>>>>>>> +        if (db != null) {
>>>>>>>> +            return unsafe.getLongVolatile(null, db.address());
>>>>>>>> +        }
>>>>>>>> +        else {
>>>>>>>> +            synchronized (this) {
>>>>>>>> -        return lb.get(0);
>>>>>>>> -    }
>>>>>>>> +                return lb.get(0);
>>>>>>>> +            }
>>>>>>>> +        }
>>>>>>>> +    }
>>>>>>>>
>>>>>>>>       /**
>>>>>>>>        * Sets the value of the perf counter to the given newValue.
>>>>>>>>        */
>>>>>>>> -    public synchronized void set(long newValue) {
>>>>>>>> +    public void set(long newValue) {
>>>>>>>> +        if (db != null) {
>>>>>>>> +            unsafe.putOrderedLong(null, db.address(), newValue);
>>>>>>>> +        }
>>>>>>>> +        else {
>>>>>>>> +            synchronized (this) {
>>>>>>>> -        lb.put(0, newValue);
>>>>>>>> -    }
>>>>>>>> +                lb.put(0, newValue);
>>>>>>>> +            }
>>>>>>>> +        }
>>>>>>>> +    }
>>>>>>>>
>>>>>>>>       /**
>>>>>>>>        * Adds the given value to the perf counter.
>>>>>>>>        */
>>>>>>>> -    public synchronized void add(long value) {
>>>>>>>> -        long res = get() + value;
>>>>>>>> +    public void add(long value) {
>>>>>>>> +        if (db != null) {
>>>>>>>> +            unsafe.getAndAddLong(null, db.address(), value);
>>>>>>>> +        }
>>>>>>>> +        else {
>>>>>>>> +            synchronized (this) {
>>>>>>>> +                long res = lb.get(0) + value;
>>>>>>>> -        lb.put(0, res);
>>>>>>>> +                lb.put(0, res);
>>>>>>>> +            }
>>>>>>>> +        }
>>>>>>>>       }
>>>>>>>>
>>>>>>>>       /**
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Testing the PerfCounter.increment() method in a loop on multiple
>>>>>>>> threads
>>>>>>>> sharing the same PerfCounter instance, for example, on a 4-core
>>>>>>>> Intel i7
>>>>>>>> machine produces the following results:
>>>>>>>>
>>>>>>>> #
>>>>>>>> # PerfCounter_increment: run duration:  5,000 ms, #of logical
>>>>>>>> CPUS: 8
>>>>>>>> #
>>>>>>>>             1 threads, Tavg =     19.02 ns/op (? = 0.00 ns/op)
>>>>>>>>             2 threads, Tavg =    109.93 ns/op (? = 6.17 ns/op)
>>>>>>>>             3 threads, Tavg =    136.64 ns/op (? = 2.99 ns/op)
>>>>>>>>             4 threads, Tavg =    293.26 ns/op (? = 5.30 ns/op)
>>>>>>>>             5 threads, Tavg =    316.94 ns/op (? = 6.28 ns/op)
>>>>>>>>             6 threads, Tavg =    686.96 ns/op (? = 7.09 ns/op)
>>>>>>>>             7 threads, Tavg =    793.28 ns/op (? = 10.57 ns/op)
>>>>>>>>             8 threads, Tavg =    898.15 ns/op (? = 14.63 ns/op)
>>>>>>>>
>>>>>>>>
>>>>>>>> With the presented patch, the results are a little better:
>>>>>>>>
>>>>>>>> #
>>>>>>>> # PerfCounter_increment: run duration:  5,000 ms, #of logical
>>>>>>>> CPUS: 8
>>>>>>>> #
>>>>>>>> # Measure:
>>>>>>>>             1 threads, Tavg =      5.22 ns/op (? = 0.00 ns/op)
>>>>>>>>             2 threads, Tavg =     34.51 ns/op (? = 0.60 ns/op)
>>>>>>>>             3 threads, Tavg =     54.85 ns/op (? = 1.42 ns/op)
>>>>>>>>             4 threads, Tavg =     74.67 ns/op (? = 1.71 ns/op)
>>>>>>>>             5 threads, Tavg =     94.71 ns/op (? = 41.68 ns/op)
>>>>>>>>             6 threads, Tavg =    114.80 ns/op (? = 32.10 ns/op)
>>>>>>>>             7 threads, Tavg =    136.70 ns/op (? = 26.80 ns/op)
>>>>>>>>             8 threads, Tavg =    158.48 ns/op (? = 9.93 ns/op)
>>>>>>>>
>>>>>>>>
>>>>>>>> The scalability is not much better, but the raw speed is, so it
>>>>>>>> might
>>>>>>>> present less contention when used in real-world code. If you 
>>>>>>>> wanted
>>>>>>>> even
>>>>>>>> better scalability, there is a new class in JDK8, the
>>>>>>>> java.util.concurrent.LongAdder. But that doesn't buy atomic
>>>>>>>> "set()" -
>>>>>>>> only "add()". And it can't update native-memory variables, so it
>>>>>>>> could
>>>>>>>> only be used for add-only counters and in conjunction with a
>>>>>>>> background
>>>>>>>> thread that would periodically flush the sum to the native
>>>>>>>> memory....
>>>>>>>>
>>>>>>>> Regards, Peter
>>>>>>>>
>>>>>>>>
>>>>>>>> On 02/08/2013 06:10 PM, Nils Loodin wrote:
>>>>>>>>> It would be interesting to know the number of thrown 
>>>>>>>>> throwables in
>>>>>>>>> the
>>>>>>>>> JVM, to be able to do some high level application diagnostics /
>>>>>>>>> statistics. A good way to put this number would be a performance
>>>>>>>>> counter, since it is accessible both from Java and from the VM.
>>>>>>>>>
>>>>>>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8007806
>>>>>>>>> http://cr.openjdk.java.net/~nloodin/8007806/webrev.00/
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Nils Loodin
>>>>>>>>
>>>>>>
>>>>
>>