Review request for 6977034 Thread.getState() very slow

Sun Dec 5 22:58:48 UTC 2010

Brian Goetz said the following on 12/06/10 06:22:
> Its worth noting that the real performance offender here is that thread 
> states are enums (and therefore object references).  If they were 
> integers, then the entire thing could most likely be done in a way that 
> is guaranteed branch-free and dcache-miss-free.
> 
> Agree with Eamonn that the current sequence-of-tests is probably good 
> enough.  Might be worth doing a little profiling to choose optimal order 
> of tests; I would guess that the right order is RUNNABLE, WAITING, 
> TIMED_WAITING, BLOCKED, NEW, TERMINATED, but that's just an OOMH 
> (out-of-my-hat) guess.

I had already had this discussion with Mandy, including the idea that a 
lookup table would be faster - if there were a simple way to construct 
it. The current order of tests is based on our discussion. No doubt 
Runnable should be first, and new/terminated last. The rest of somewhat 
subjective. As sync is used more than wait/notify then BLOCKED would be 
next most likely. For waiting vs timed-waiting it depends on how 
defensively people code their waits :)

Anyway, this was all premised on slow performance that Doug observed as 
part of ForkJoin mechanics. It would be good to hear back from him on 
how this updated approach performs. If the FJ code has changed such that 
this is no longer an issue then I would suggest that Mandy's changes are 
"good enough" and we let her move on.

Cheers,
David

  There are probably not enough bits here to
> justify a binary search (which trades off best-case against 
> average/worst-case time, but might turn up a combination of bits that 
> has better branch prediction characteristics.)
> 
> 
> On 12/5/2010 2:00 PM, Eamonn McManus wrote:
>> Yeah, it was a bit blithe of me to write that the sequence of tests was
>> faster. In the table-lookup version, if you get rid of the initial 
>> test for
>> RUNNABLE, and if you use Integer.numberOfLeadingZeros, and if the JIT 
>> compiler
>> intrinsifies that to a native processor instruction, and if the lookup 
>> table
>> is in L1 cache, then the table-lookup version will run in constant 
>> time and be
>> better than the worst case of the sequence-of-tests version, and probably
>> better than the average case too. But, as you say, that last /if/ (the 
>> cache
>> hit) will usually not be true, and in that case I would not be 
>> surprised if
>> the sequence of tests were faster even in its worst case.
>>
>> Anyway the sequence-of-tests version is unquestionably simpler, and I 
>> would
>> venture that the best solution is probably to go with that, plus a new 
>> method
>> in the API that explicitly tests whether a thread is runnable. That's 
>> trivial
>> to implement now that Mandy has pulled the knowledge of state bits 
>> into the
>> Java code rather than being hidden in the bowels of the VM; and its
>> implementation will be faster than (Thread.getState() == RUNNABLE) 
>> regardless
>> of the implementation of the latter.
>>
>> Éamonn
>>
>>
>> On 5/12/10 8:27 AM, Brian Goetz wrote:
>>> As Eamonn writes it, it will never cache miss but may frequently branch
>>> mispredict (possibly multiple times).  If you do a shift + mask + 
>>> index into
>>> a small table, it will cache miss most the time but never branch 
>>> mispredict.
>>> (In a real program it will cache miss frequently since thread state 
>>> calls
>>> are infrequent and the lookup table will fall out of cache; in a
>>> microbenchmark it will almost never cache miss as the lookup table 
>>> will be
>>> hot.)
>>>
>>> On 12/4/2010 7:22 AM, Eamonn McManus wrote:
>>>> Hi Mandy,
>>>>
>>>> This test:
>>>>
>>>>          if ((threadStatus&  JVMTI_THREAD_STATE_RUNNABLE) == 1) {
>>>>
>>>> is always false, since JVMTI_THREAD_STATE_RUNNABLE is 4. (NetBeans 7.0
>>>> helpfully flags this; I'm not sure if earlier versions do.)
>>>>
>>>> But, once corrected, I think you could use this idea further to 
>>>> write a much
>>>> simpler and faster method, on these lines:
>>>>
>>>>      public static Thread.State toThreadState(int threadStatus) {
>>>>          if ((threadStatus&  JVMTI_THREAD_STATE_RUNNABLE)*!= 0*) {
>>>>              return RUNNABLE;
>>>>          } else if ((threadStatus&
>>>> JVMTI_THREAD_STATE_BLOCKED_ON_MONITOR_ENTER) != 0) {
>>>>              return BLOCKED;
>>>>          } else if ((threadStatus&
>>>> JVMTI_THREAD_STATE_WAITING_WITH_TIMEOUT) != 0) {
>>>>              return TIMED_WAITING;
>>>>          } else if ((threadStatus&
>>>> JVMTI_THREAD_STATE_WAITING_INDEFINITELY) != 0) {
>>>>              return WAITING;
>>>>          } else if ((threadStatus&  JVMTI_THREAD_STATE_TERMINATED) 
>>>> != 0) {
>>>>              return TERMINATED;
>>>>          } else {
>>>>              return NEW;
>>>>          }
>>>>      }
>>>>
>>>> You could tweak the order of the tests based on what might be the 
>>>> relative
>>>> frequency of the different states but it probably isn't worth it.
>>>>
>>>> Regards,
>>>>
>>>> Éamonn
>>>>
>>>>
>>>> On 3/12/10 11:52 PM, Mandy Chung wrote:
>>>>> Fix for 6977034: Thread.getState() very slow
>>>>>
>>>>> Webrev at:
>>>>> http://cr.openjdk.java.net/~mchung/6977034/webrev.00/
>>>>>
>>>>> This is an improvement to map a Thread's threadStatus field to 
>>>>> Thread.State.
>>>>> The VM updates the Thread.threadStatus field directly at state 
>>>>> transition
>>>>> with the value as defined in JVM TI [1]. The 
>>>>> java.lang.Thread.getState()
>>>>> implementation can directly access the threadStatus value and do a 
>>>>> direct
>>>>> lookup from an array of Thread.State. The threadStatus value is a 
>>>>> bit vector
>>>>> and we would have to create an array of a minimum of 1061 (0x425) 
>>>>> elements
>>>>> to do direct mapping. I took the approach to use the first highest 
>>>>> order bit
>>>>> set to 1 in the masked threadStatus value as the index to the 
>>>>> Thread.State
>>>>> element and only caches 32 elements (could be fewer). I wrote a
>>>>> micro-benchmark measuring the Thread.getState of a thread in 
>>>>> different state
>>>>> that shows 1.7X to 6X speedup (see below). There is possibly some 
>>>>> issue with
>>>>> my micro-benchmark that I didn't observe the 14X speed up as Doug 
>>>>> did in his
>>>>> experiment. However, I'd like to get this reviewed and pushed to the
>>>>> repository so that anyone can do more experiment on the performance
>>>>> measurement.
>>>>>
>>>>> Thanks
>>>>> Mandy
>>>>> P.S. The discussion on this thread can be found at [2] [3].
>>>>>
>>>>> [1]
>>>>> http://download.java.net/jdk7/docs/platform/jvmti/jvmti.html#GetThreadState 
>>>>>
>>>>> [2] 
>>>>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2010-July/004567.html 
>>>>>
>>>>> [3]
>>>>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2010-August/004721.html 
>>>>>
>>>>>
>>>>>
>>>>>     JDK 7 b120 (in ms)    With fix (in ms)    Speed up
>>>>> main        46465            22772            2.04
>>>>> NEW        50676        29921            1.69
>>>>> RUNNABLE    42202        14690            2.87
>>>>> BLOCKED        72773        12296            5.92
>>>>> WAITING        48811        13041            3.74
>>>>> TIMED_WAITING    45737        12849            3.56
>>>>> TERMINATED    40314        16376            2.46
>>>>>