Strange interaction with hyperthreading on Intel hybrid CPU

Sun Oct 15 12:08:59 UTC 2023

To echo what @robert engels <rengels at ix.netcom.com> said,
https://www.moreno.marzolla.name/teaching/HPC/vol6iss1_art01.pdf which is a
bit old, but relevant enough..
>From my understanding, the workload where cache misses are a factor, HT can
be beneficial, because the CPU can keep on feeding the CPU frontend (or
experience less L3 transitions because 2 SMP thread can share the same
data, actually). Sadly both cache misses and computational intensive tasks
are both considered CPU-bound scenarios, while tbh they can be
frontend/backend bound instead, and although not I/O intensive, if
backend-bound, HT can boost VT workload, but just because due to the nature
of workload...
That's why I suggest (for exploration) to use a proper profiler which can
report cache misses or specific CPU events.

Il dom 15 ott 2023, 13:55 Robert Engels <rengels at ix.netcom.com> ha scritto:

> In my HFT experience we never used HT cores. It was almost always slower.
>
> Here’s why. The kernel scheduler job is to allocate work to cores. The
> more cores the more management is has to do (context management) Usually
> this is ok because the increased number of cores runs more work.
>
> The latter point may not hold based on workload. The OS does not have
> visibility into what the work profile for a particular thread is - so if it
> scheduler essentially identical workloads (e.g all integer or all floating
> point) on two logical cores (same physical core) the physical can’t fully
> parallelize them (since they use the same components - typically is one
> core is blocked waiting on memory the other core can run a computation, etc)
>
> The end result is that the OS spends extra effort managing the work with
> no gain = slower.
>
> My suggestion is to always turn off HT.
>
> Note that HT is very different on some architectures like RISC where the
> simple instructions and the pipeline make it easier to parallelize via
> shifting.
>
> On Oct 15, 2023, at 5:11 AM, Francesco Nigro <nigro.fra at gmail.com> wrote:
>
> 
> I suggest to use a profiler which can show more than the java side here,
> async profiler.
> But please beware the suggestion
> https://github.com/async-profiler/async-profiler/issues/779#issuecomment-1651104553
> from one of the Loom team's member.
>
> Il mer 11 ott 2023, 18:54 Michael van Acken <michael.van.acken at gmail.com>
> ha scritto:
>
>> Given the huge difference of a factor of 2 in user time between the
>> default and the nosmt setup, I tried to use jfr to find some metric that
>> differs markedly between the two.  The workload is the same: the very same
>> task is executed leading to the expected result.  This time it's 300 back
>> to back compilations within a single java process.  Using the threadId() of
>> a final virtual thread as proxy, ~570k threads seem to be utilized overall.
>>
>> "jfr view hot-methods" does not show any significant difference, with the
>> top entry being ForkJoinPool.awaitWork() at around 5.5% in both cases.
>>
>> But "jfr view latencies-by-type" shows a large difference in its Total
>> column for "Java Thread Park".  Could this be a clue where the user time
>> accumulates?
>>
>> ### with "nosmt"
>>
>> real 77.67
>> user 468.16
>> sys 13.48
>>
>> jfr view latencies-by-type recording.jfr
>>                                Latencies by Type
>>
>> Event Type                              Count Average    P 99 Longest
>> Total
>> -------------------------------------- ------ ------- ------- -------
>> ---------
>> Java Thread Park                       18.651 36,9 ms  310 ms  2,88 s 11
>> m 43 s
>> File Write                                  2 11,7 ms 12,6 ms 12,6 ms
>> 23,4 ms
>>
>> ### default (without "nosmt")
>>
>> real 93.60
>> user 824.12
>> sys 23.08
>>
>> jfr view latencies-by-type recording.jfr
>>                                Latencies by Type
>>
>> Event Type                               Count Average    P 99 Longest
>>  Total
>> --------------------------------------- ------ ------- ------- -------
>> --------
>> Java Thread Park                        30.263 45,7 ms  256 ms  504 ms 23
>> m 2 s
>> File Read                                    1 10,9 ms 10,9 ms 10,9 ms
>>  10,9 ms
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20231015/2d67c5ef/attachment-0001.htm>