Strange interaction with hyperthreading on Intel hybrid CPU

Sun Oct 15 11:55:39 UTC 2023

In my HFT experience we never used HT cores. It was almost always slower. 

Here’s why. The kernel scheduler job is to allocate work to cores. The more cores the more management is has to do (context management) Usually this is ok because the increased number of cores runs more work. 

The latter point may not hold based on workload. The OS does not have visibility into what the work profile for a particular thread is - so if it scheduler essentially identical workloads (e.g all integer or all floating point) on two logical cores (same physical core) the physical can’t fully parallelize them (since they use the same components - typically is one core is blocked waiting on memory the other core can run a computation, etc)

The end result is that the OS spends extra effort managing the work with no gain = slower.  

My suggestion is to always turn off HT. 

Note that HT is very different on some architectures like RISC where the simple instructions and the pipeline make it easier to parallelize via shifting.   

> On Oct 15, 2023, at 5:11 AM, Francesco Nigro <nigro.fra at gmail.com> wrote:
> 
> 
> I suggest to use a profiler which can show more than the java side here, async profiler.
> But please beware the suggestion https://github.com/async-profiler/async-profiler/issues/779#issuecomment-1651104553 from one of the Loom team's member.
> 
> Il mer 11 ott 2023, 18:54 Michael van Acken <michael.van.acken at gmail.com> ha scritto:
>> Given the huge difference of a factor of 2 in user time between the default and the nosmt setup, I tried to use jfr to find some metric that differs markedly between the two.  The workload is the same: the very same task is executed leading to the expected result.  This time it's 300 back to back compilations within a single java process.  Using the threadId() of a final virtual thread as proxy, ~570k threads seem to be utilized overall.
>> 
>> "jfr view hot-methods" does not show any significant difference, with the top entry being ForkJoinPool.awaitWork() at around 5.5% in both cases.
>> 
>> But "jfr view latencies-by-type" shows a large difference in its Total column for "Java Thread Park".  Could this be a clue where the user time accumulates?
>> 
>> ### with "nosmt"
>> 
>> real 77.67
>> user 468.16
>> sys 13.48
>> 
>> jfr view latencies-by-type recording.jfr
>>                                Latencies by Type
>> 
>> Event Type                              Count Average    P 99 Longest     Total
>> -------------------------------------- ------ ------- ------- ------- ---------
>> Java Thread Park                       18.651 36,9 ms  310 ms  2,88 s 11 m 43 s
>> File Write                                  2 11,7 ms 12,6 ms 12,6 ms   23,4 ms
>> 
>> ### default (without "nosmt")
>> 
>> real 93.60
>> user 824.12
>> sys 23.08
>> 
>> jfr view latencies-by-type recording.jfr
>>                                Latencies by Type
>> 
>> Event Type                               Count Average    P 99 Longest    Total
>> --------------------------------------- ------ ------- ------- ------- --------
>> Java Thread Park                        30.263 45,7 ms  256 ms  504 ms 23 m 2 s
>> File Read                                    1 10,9 ms 10,9 ms 10,9 ms  10,9 ms
>> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20231015/00258a73/attachment.htm>