Strange interaction with hyperthreading on Intel hybrid CPU

Francesco Nigro nigro.fra at gmail.com
Sun Oct 15 14:31:13 UTC 2023


For HT I am not that negative, but given that x86 cannot extract enough ILP
and perform all sort of tricks to keep th e CPU pipeline busy (to not
mention the variable sized instr forcing to add all sort of caching),
having SMT seems to be the solution of an arch problem (oversubscription to
achieve full utilization, which is fun?), to squeeze the most out of it.
Having separate interrupt controller too means being able to handle
interrupts separately, which is not a bad thing for I/O driven workloads
too (but I have to refresh my memory on this, I could be very wrong).

What you have discovered is interesting and valuable, which is the same
problem of the so called cpu-usage metrics (broken regardless) especially
because it consider a logical core on par of a full fat one...similarly,
the JVM heuristics based their assumptions considering all cores to be
"equals", but funny enough, with new p/e cores this thing is even more
invalid than it has been for HT. Thanks for sharing this :)

I can reiterate: If you collect the CPU profiling data with async-profiler,
per thread (-t) you would end spot it in one go :P

Il dom 15 ott 2023, 16:09 Michael van Acken <michael.van.acken at gmail.com>
ha scritto:

> Am So., 15. Okt. 2023 um 14:09 Uhr schrieb Francesco Nigro <
> nigro.fra at gmail.com>:
>
>> To echo what @robert engels <rengels at ix.netcom.com> said,
>> https://www.moreno.marzolla.name/teaching/HPC/vol6iss1_art01.pdf which
>> is a bit old, but relevant enough..
>> From my understanding, the workload where cache misses are a factor, HT
>> can be beneficial, because the CPU can keep on feeding the CPU frontend (or
>> experience less L3 transitions because 2 SMP thread can share the same
>> data, actually). Sadly both cache misses and computational intensive tasks
>> are both considered CPU-bound scenarios, while tbh they can be
>> frontend/backend bound instead, and although not I/O intensive, if
>> backend-bound, HT can boost VT workload, but just because due to the nature
>> of workload...
>> That's why I suggest (for exploration) to use a proper profiler which can
>> report cache misses or specific CPU events.
>>
>
> I've seen HT described as a cheap means (with regard to transistor count
> or chip area) to push some workloads/benchmarks higher.  Such a
> cost/benefit analysis probably makes sense for a consumer CPU like this one.
>
> For this reason I'm not surprised that HT does not deliver much upside or
> downside for my workload.  Memory access patterns of compilers tend to be
> irregular anyway.  What brought this situation to my attention was the
> considerable additional resource consumption when HT is enabled.  But this
> seems to be only the result of a c1/c2 compilation ergonomics decision on
> JVM startup, caused by counting "HT cores" as "full cores".
>
> The CPU events recorded by async-profiler support this story:
>
> 8+0+0: 41750 samples total, with 23160 (55.47%) under
> CompileBroker::compiler_thread_loop()
> 8+8+0: 78431 samples total, with 54435 (69.40%)
> under CompileBroker::compiler_thread_loop()
>
> The additional resources spent on compilation do not pay off here, neither
> in the first nor over 200 iterations, and they even cannibalize the work
> virtual threads could be doing for the application instead.
>
> -- mva
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20231015/56ddc355/attachment.htm>


More information about the loom-dev mailing list