RFR: 8007806: Need a Throwables performance counter
David Holmes
david.holmes at oracle.com
Sun Feb 24 21:48:50 UTC 2013
Peter,
On 25/02/2013 7:25 AM, Peter Levart wrote:
>
> On 02/24/2013 09:57 PM, David Holmes wrote:
>> On 25/02/2013 6:18 AM, Peter Levart wrote:
>>> Hi Alan, David, Nils,
>>>
>>> I just want to clear something regarding PerfCounter implementation.
>>>
>>> Access to 64bit value in native memory is through a direct buffer which
>>> uses normal read/write (non-volatile, Unsafe.[get|set]Long). So on
>>> processors that don't support atomic 64bit stores/loads, each access
>>> results in two separate 32bit load/store accesses right?
>>
>> Unsafe.get|setLong uses locking on those platforms.
>
> Even if it does, it is important whether "all" accesses to this 64bit
> value are using locking and whether they are using the same lock. Aren't
> performance counters JVM native variables where just some of them happen
> to be updated from Java?
AFAICS PerfCounters have no thread-safety properties or guarantees. So
it is up to the user of each counter to use it in an appropriate way. I
think the serviceability folk would have to chime in here as to how
PerfCounters are supposed to be used.
>>> The PerfCounter methods that access the 64bit value are synchronized,
>>> using PerfCounter instance as a lock. But how is this 64bit value
>>> accessed for example in the jstat utility? Is it possible that jstat can
>>> see one half of the 64bit value before the update and the other half
>>> after the update?
>>
>> Does jstat access these values directly or only via the synchronized
>> methods? If the latter then the value can't be "torn" that way. The
>> sync method will store the value in 2 32-bit registers, and the
>> variable load in jstat will take two instructions, but nothing can
>> touch those registers.
>
> I'm not saying that the value could be corrupted in any way, just that
> the unsynchronized observer (like jstat) could see it "torn" sometimes.
If the value is initially read in a sync block and all updates are also
synchronized, then I don't think it can. But you need to look at actual
code to determine this.
David
> Regards, Peter
>
>>
>> David
>> -----
>>
>>> If this is true and it is not that important, then instead of a
>>> synchronized update of 64bit counter, a 32bit CAS could be used,
>>> optionally (rarely) followed by a second 32bit CAS, like for example:
>>>
>>> http://dl.dropbox.com/u/101777488/jdk8-tl/PerfCounter/webrev.01/index.html
>>>
>>>
>>> I tried this on ARM v6 and it works much better than synchronized
>>> access, but I don't know if it's acceptable. It guarantees eventual
>>> correctness of summed value if the only operation performed is add() (no
>>> set() intermingled) and has the same possibility of incorrect half-half
>>> reads by observers as current PerfCounter has for unsynchronized
>>> observers.
>>>
>>> Here's the comparison of unpatched/patched PerfCounter.increment()
>>> micro-benchmark on single-core ARM v6 (Raspbery-PI):
>>>
>>> *** Original PerfCounter, ARM v6
>>>
>>> #
>>> # PerfCounter_increment: run duration: 5,000 ms, #of logical CPUS: 1
>>> #
>>> 1 threads, Tavg = 269.34 ns/op (σ = 0.00 ns/op) [
>>> 269.34]
>>> 2 threads, Tavg = 7,170.48 ns/op (σ = 410.77 ns/op) [
>>> 6,783.73, 7,603.95]
>>> 3 threads, Tavg = 12,034.82 ns/op (σ = 418.99 ns/op)
>>> [11,792.33, 11,714.67, 12,639.26]
>>> 4 threads, Tavg = 16,029.76 ns/op (σ = 1,411.44 ns/op)
>>> [15,592.04, 18,511.52, 15,642.52, 14,818.16]
>>>
>>>
>>> *** Patched PerfCounter, ARM v6
>>>
>>> #
>>> # PerfCounter_increment: run duration: 5,000 ms, #of logical CPUS: 1
>>> #
>>> 1 threads, Tavg = 166.21 ns/op (σ = 0.00 ns/op) [
>>> 166.21]
>>> 2 threads, Tavg = 332.58 ns/op (σ = 0.12 ns/op) [
>>> 332.45, 332.70]
>>> 3 threads, Tavg = 500.30 ns/op (σ = 0.22 ns/op) [
>>> 500.04, 500.29, 500.58]
>>> 4 threads, Tavg = 667.95 ns/op (σ = 2.11 ns/op) [
>>> 665.22, 667.18, 668.40, 671.04]
>>>
>>>
>>> Regards, Peter
>>>
>>>
>>> On 02/24/2013 11:31 AM, David Holmes wrote:
>>>> On 24/02/2013 6:50 PM, Peter Levart wrote:
>>>>> Hi David,
>>>>>
>>>>> I thought it was ok to pass null, but I don't know the "portability"
>>>>> issues in-depth. The javadoc for Unsafe says:
>>>>>
>>>>> /"This method refers to a variable by means of two parameters, and
>>>>> so it
>>>>> provides (in effect) a double-register addressing mode for Java
>>>>> variables. When the object reference is null, this method uses its
>>>>> offset as an absolute address. This is similar in operation to methods
>>>>> such as getInt(long), which provide (in effect) a single-register
>>>>> addressing mode for non-Java variables. However, because Java
>>>>> variables
>>>>> may have a different layout in memory from non-Java variables,
>>>>> programmers should not assume that these two addressing modes are ever
>>>>> equivalent. Also, programmers should remember that offsets from the
>>>>> double-register addressing mode cannot be portably confused with longs
>>>>> used in the single-register addressing mode."/
>>>>
>>>> That is the doc for getXXX but not for getAndAddXXX or
>>>> compareAndSwapXXX. You can't have null here:
>>>>
>>>> UNSAFE_ENTRY(jboolean, Unsafe_CompareAndSwapLong(JNIEnv *env, jobject
>>>> unsafe, jobject obj, jlong offset, jlong e, jlong x))
>>>> UnsafeWrapper("Unsafe_CompareAndSwapLong");
>>>> Handle p (THREAD, JNIHandles::resolve(obj));
>>>> jlong* addr = (jlong*)(index_oop_from_field_offset_long(p(),
>>>> offset));
>>>> if (VM_Version::supports_cx8())
>>>> return (jlong)(Atomic::cmpxchg(x, addr, e)) == e;
>>>> else {
>>>> jboolean success = false;
>>>> ObjectLocker ol(p, THREAD);
>>>> if (*addr == e) { *addr = x; success = true; }
>>>> return success;
>>>> }
>>>> UNSAFE_END
>>>>
>>>> David
>>>> -----
>>>>
>>>>
>>>>> Does anybody know the in-depth interpretation of the above? Is it only
>>>>> the particular Java/native type differences (for example, endianess of
>>>>> variables) that these two addressing modes might interpret differently
>>>>> or something else too?
>>>>>
>>>>> Regards, Peter
>>>>>
>>>>>
>>>>> On 02/24/2013 12:39 AM, David Holmes wrote:
>>>>>> Peter,
>>>>>>
>>>>>> In your use of Unsafe you pass "null" as the object. I'm pretty
>>>>>> certain you can't pass null here. Unsafe operates on fields or array
>>>>>> elements.
>>>>>>
>>>>>> David
>>>>>>
>>>>>> On 24/02/2013 5:39 AM, Peter Levart wrote:
>>>>>>> Hi Nils,
>>>>>>>
>>>>>>> If the counters are updated frequently from multiple threads, there
>>>>>>> might be contention/scalability issues. Instead of
>>>>>>> synchronization on
>>>>>>> updates, you might consider using atomic updates provided by
>>>>>>> sun.misc.Unsafe, like for example:
>>>>>>>
>>>>>>>
>>>>>>> Index: jdk/src/share/classes/sun/misc/PerfCounter.java
>>>>>>> ===================================================================
>>>>>>> --- jdk/src/share/classes/sun/misc/PerfCounter.java
>>>>>>> +++ jdk/src/share/classes/sun/misc/PerfCounter.java
>>>>>>> @@ -25,6 +25,8 @@
>>>>>>>
>>>>>>> package sun.misc;
>>>>>>>
>>>>>>> +import sun.nio.ch.DirectBuffer;
>>>>>>> +
>>>>>>> import java.nio.ByteBuffer;
>>>>>>> import java.nio.ByteOrder;
>>>>>>> import java.nio.LongBuffer;
>>>>>>> @@ -50,6 +52,8 @@
>>>>>>> public class PerfCounter {
>>>>>>> private static final Perf perf =
>>>>>>> AccessController.doPrivileged(new Perf.GetPerfAction());
>>>>>>> + private static final Unsafe unsafe =
>>>>>>> + Unsafe.getUnsafe();
>>>>>>>
>>>>>>> // Must match values defined in
>>>>>>> hotspot/src/share/vm/runtime/perfdata.hpp
>>>>>>> private final static int V_Constant = 1;
>>>>>>> @@ -59,12 +63,14 @@
>>>>>>>
>>>>>>> private final String name;
>>>>>>> private final LongBuffer lb;
>>>>>>> + private final DirectBuffer db;
>>>>>>>
>>>>>>> private PerfCounter(String name, int type) {
>>>>>>> this.name = name;
>>>>>>> ByteBuffer bb = perf.createLong(name, U_None, type, 0L);
>>>>>>> bb.order(ByteOrder.nativeOrder());
>>>>>>> this.lb = bb.asLongBuffer();
>>>>>>> + this.db = bb instanceof DirectBuffer ? (DirectBuffer) bb :
>>>>>>> null;
>>>>>>> }
>>>>>>>
>>>>>>> static PerfCounter newPerfCounter(String name) {
>>>>>>> @@ -79,23 +85,44 @@
>>>>>>> /**
>>>>>>> * Returns the current value of the perf counter.
>>>>>>> */
>>>>>>> - public synchronized long get() {
>>>>>>> + public long get() {
>>>>>>> + if (db != null) {
>>>>>>> + return unsafe.getLongVolatile(null, db.address());
>>>>>>> + }
>>>>>>> + else {
>>>>>>> + synchronized (this) {
>>>>>>> - return lb.get(0);
>>>>>>> - }
>>>>>>> + return lb.get(0);
>>>>>>> + }
>>>>>>> + }
>>>>>>> + }
>>>>>>>
>>>>>>> /**
>>>>>>> * Sets the value of the perf counter to the given newValue.
>>>>>>> */
>>>>>>> - public synchronized void set(long newValue) {
>>>>>>> + public void set(long newValue) {
>>>>>>> + if (db != null) {
>>>>>>> + unsafe.putOrderedLong(null, db.address(), newValue);
>>>>>>> + }
>>>>>>> + else {
>>>>>>> + synchronized (this) {
>>>>>>> - lb.put(0, newValue);
>>>>>>> - }
>>>>>>> + lb.put(0, newValue);
>>>>>>> + }
>>>>>>> + }
>>>>>>> + }
>>>>>>>
>>>>>>> /**
>>>>>>> * Adds the given value to the perf counter.
>>>>>>> */
>>>>>>> - public synchronized void add(long value) {
>>>>>>> - long res = get() + value;
>>>>>>> + public void add(long value) {
>>>>>>> + if (db != null) {
>>>>>>> + unsafe.getAndAddLong(null, db.address(), value);
>>>>>>> + }
>>>>>>> + else {
>>>>>>> + synchronized (this) {
>>>>>>> + long res = lb.get(0) + value;
>>>>>>> - lb.put(0, res);
>>>>>>> + lb.put(0, res);
>>>>>>> + }
>>>>>>> + }
>>>>>>> }
>>>>>>>
>>>>>>> /**
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Testing the PerfCounter.increment() method in a loop on multiple
>>>>>>> threads
>>>>>>> sharing the same PerfCounter instance, for example, on a 4-core
>>>>>>> Intel i7
>>>>>>> machine produces the following results:
>>>>>>>
>>>>>>> #
>>>>>>> # PerfCounter_increment: run duration: 5,000 ms, #of logical
>>>>>>> CPUS: 8
>>>>>>> #
>>>>>>> 1 threads, Tavg = 19.02 ns/op (? = 0.00 ns/op)
>>>>>>> 2 threads, Tavg = 109.93 ns/op (? = 6.17 ns/op)
>>>>>>> 3 threads, Tavg = 136.64 ns/op (? = 2.99 ns/op)
>>>>>>> 4 threads, Tavg = 293.26 ns/op (? = 5.30 ns/op)
>>>>>>> 5 threads, Tavg = 316.94 ns/op (? = 6.28 ns/op)
>>>>>>> 6 threads, Tavg = 686.96 ns/op (? = 7.09 ns/op)
>>>>>>> 7 threads, Tavg = 793.28 ns/op (? = 10.57 ns/op)
>>>>>>> 8 threads, Tavg = 898.15 ns/op (? = 14.63 ns/op)
>>>>>>>
>>>>>>>
>>>>>>> With the presented patch, the results are a little better:
>>>>>>>
>>>>>>> #
>>>>>>> # PerfCounter_increment: run duration: 5,000 ms, #of logical
>>>>>>> CPUS: 8
>>>>>>> #
>>>>>>> # Measure:
>>>>>>> 1 threads, Tavg = 5.22 ns/op (? = 0.00 ns/op)
>>>>>>> 2 threads, Tavg = 34.51 ns/op (? = 0.60 ns/op)
>>>>>>> 3 threads, Tavg = 54.85 ns/op (? = 1.42 ns/op)
>>>>>>> 4 threads, Tavg = 74.67 ns/op (? = 1.71 ns/op)
>>>>>>> 5 threads, Tavg = 94.71 ns/op (? = 41.68 ns/op)
>>>>>>> 6 threads, Tavg = 114.80 ns/op (? = 32.10 ns/op)
>>>>>>> 7 threads, Tavg = 136.70 ns/op (? = 26.80 ns/op)
>>>>>>> 8 threads, Tavg = 158.48 ns/op (? = 9.93 ns/op)
>>>>>>>
>>>>>>>
>>>>>>> The scalability is not much better, but the raw speed is, so it
>>>>>>> might
>>>>>>> present less contention when used in real-world code. If you wanted
>>>>>>> even
>>>>>>> better scalability, there is a new class in JDK8, the
>>>>>>> java.util.concurrent.LongAdder. But that doesn't buy atomic
>>>>>>> "set()" -
>>>>>>> only "add()". And it can't update native-memory variables, so it
>>>>>>> could
>>>>>>> only be used for add-only counters and in conjunction with a
>>>>>>> background
>>>>>>> thread that would periodically flush the sum to the native
>>>>>>> memory....
>>>>>>>
>>>>>>> Regards, Peter
>>>>>>>
>>>>>>>
>>>>>>> On 02/08/2013 06:10 PM, Nils Loodin wrote:
>>>>>>>> It would be interesting to know the number of thrown throwables in
>>>>>>>> the
>>>>>>>> JVM, to be able to do some high level application diagnostics /
>>>>>>>> statistics. A good way to put this number would be a performance
>>>>>>>> counter, since it is accessible both from Java and from the VM.
>>>>>>>>
>>>>>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8007806
>>>>>>>> http://cr.openjdk.java.net/~nloodin/8007806/webrev.00/
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Nils Loodin
>>>>>>>
>>>>>
>>>
>
More information about the core-libs-dev
mailing list