Review request for 6977034 Thread.getState() very slow

Sun Dec 5 20:22:42 UTC 2010

Its worth noting that the real performance offender here is that thread states 
are enums (and therefore object references).  If they were integers, then the 
entire thing could most likely be done in a way that is guaranteed branch-free 
and dcache-miss-free.

Agree with Eamonn that the current sequence-of-tests is probably good enough. 
  Might be worth doing a little profiling to choose optimal order of tests; I 
would guess that the right order is RUNNABLE, WAITING, TIMED_WAITING, BLOCKED, 
NEW, TERMINATED, but that's just an OOMH (out-of-my-hat) guess.  There are 
probably not enough bits here to justify a binary search (which trades off 
best-case against average/worst-case time, but might turn up a combination of 
bits that has better branch prediction characteristics.)

On 12/5/2010 2:00 PM, Eamonn McManus wrote:
> Yeah, it was a bit blithe of me to write that the sequence of tests was
> faster. In the table-lookup version, if you get rid of the initial test for
> RUNNABLE, and if you use Integer.numberOfLeadingZeros, and if the JIT compiler
> intrinsifies that to a native processor instruction, and if the lookup table
> is in L1 cache, then the table-lookup version will run in constant time and be
> better than the worst case of the sequence-of-tests version, and probably
> better than the average case too. But, as you say, that last /if/ (the cache
> hit) will usually not be true, and in that case I would not be surprised if
> the sequence of tests were faster even in its worst case.
>
> Anyway the sequence-of-tests version is unquestionably simpler, and I would
> venture that the best solution is probably to go with that, plus a new method
> in the API that explicitly tests whether a thread is runnable. That's trivial
> to implement now that Mandy has pulled the knowledge of state bits into the
> Java code rather than being hidden in the bowels of the VM; and its
> implementation will be faster than (Thread.getState() == RUNNABLE) regardless
> of the implementation of the latter.
>
> Éamonn
>
>
> On 5/12/10 8:27 AM, Brian Goetz wrote:
>> As Eamonn writes it, it will never cache miss but may frequently branch
>> mispredict (possibly multiple times).  If you do a shift + mask + index into
>> a small table, it will cache miss most the time but never branch mispredict.
>> (In a real program it will cache miss frequently since thread state calls
>> are infrequent and the lookup table will fall out of cache; in a
>> microbenchmark it will almost never cache miss as the lookup table will be
>> hot.)
>>
>> On 12/4/2010 7:22 AM, Eamonn McManus wrote:
>>> Hi Mandy,
>>>
>>> This test:
>>>
>>>          if ((threadStatus&  JVMTI_THREAD_STATE_RUNNABLE) == 1) {
>>>
>>> is always false, since JVMTI_THREAD_STATE_RUNNABLE is 4. (NetBeans 7.0
>>> helpfully flags this; I'm not sure if earlier versions do.)
>>>
>>> But, once corrected, I think you could use this idea further to write a much
>>> simpler and faster method, on these lines:
>>>
>>>      public static Thread.State toThreadState(int threadStatus) {
>>>          if ((threadStatus&  JVMTI_THREAD_STATE_RUNNABLE)*!= 0*) {
>>>              return RUNNABLE;
>>>          } else if ((threadStatus&
>>> JVMTI_THREAD_STATE_BLOCKED_ON_MONITOR_ENTER) != 0) {
>>>              return BLOCKED;
>>>          } else if ((threadStatus&
>>> JVMTI_THREAD_STATE_WAITING_WITH_TIMEOUT) != 0) {
>>>              return TIMED_WAITING;
>>>          } else if ((threadStatus&
>>> JVMTI_THREAD_STATE_WAITING_INDEFINITELY) != 0) {
>>>              return WAITING;
>>>          } else if ((threadStatus&  JVMTI_THREAD_STATE_TERMINATED) != 0) {
>>>              return TERMINATED;
>>>          } else {
>>>              return NEW;
>>>          }
>>>      }
>>>
>>> You could tweak the order of the tests based on what might be the relative
>>> frequency of the different states but it probably isn't worth it.
>>>
>>> Regards,
>>>
>>> Éamonn
>>>
>>>
>>> On 3/12/10 11:52 PM, Mandy Chung wrote:
>>>> Fix for 6977034: Thread.getState() very slow
>>>>
>>>> Webrev at:
>>>> http://cr.openjdk.java.net/~mchung/6977034/webrev.00/
>>>>
>>>> This is an improvement to map a Thread's threadStatus field to Thread.State.
>>>> The VM updates the Thread.threadStatus field directly at state transition
>>>> with the value as defined in JVM TI [1]. The java.lang.Thread.getState()
>>>> implementation can directly access the threadStatus value and do a direct
>>>> lookup from an array of Thread.State. The threadStatus value is a bit vector
>>>> and we would have to create an array of a minimum of 1061 (0x425) elements
>>>> to do direct mapping. I took the approach to use the first highest order bit
>>>> set to 1 in the masked threadStatus value as the index to the Thread.State
>>>> element and only caches 32 elements (could be fewer). I wrote a
>>>> micro-benchmark measuring the Thread.getState of a thread in different state
>>>> that shows 1.7X to 6X speedup (see below). There is possibly some issue with
>>>> my micro-benchmark that I didn't observe the 14X speed up as Doug did in his
>>>> experiment. However, I'd like to get this reviewed and pushed to the
>>>> repository so that anyone can do more experiment on the performance
>>>> measurement.
>>>>
>>>> Thanks
>>>> Mandy
>>>> P.S. The discussion on this thread can be found at [2] [3].
>>>>
>>>> [1]
>>>> http://download.java.net/jdk7/docs/platform/jvmti/jvmti.html#GetThreadState
>>>> [2] http://mail.openjdk.java.net/pipermail/core-libs-dev/2010-July/004567.html
>>>> [3]
>>>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2010-August/004721.html
>>>>
>>>>
>>>>     JDK 7 b120 (in ms)    With fix (in ms)    Speed up
>>>> main        46465            22772            2.04
>>>> NEW        50676        29921            1.69
>>>> RUNNABLE    42202        14690            2.87
>>>> BLOCKED        72773        12296            5.92
>>>> WAITING        48811        13041            3.74
>>>> TIMED_WAITING    45737        12849            3.56
>>>> TERMINATED    40314        16376            2.46
>>>>