<div dir="auto"><div>To echo what <span class="gmail_chip gmail_plusreply" dir="auto"><a href="mailto:rengels@ix.netcom.com" style="color:#15c;text-decoration:underline" rel="noreferrer noreferrer" target="_blank">@robert engels</a></span><span> said, </span><a href="https://www.moreno.marzolla.name/teaching/HPC/vol6iss1_art01.pdf" target="_blank" rel="noreferrer">https://www.moreno.marzolla.name/teaching/HPC/vol6iss1_art01.pdf</a> which is a bit old, but relevant enough..</div><div dir="auto">From my understanding, the workload where cache misses are a factor, HT can be beneficial, because the CPU can keep on feeding the CPU frontend (or experience less L3 transitions because 2 SMP thread can share the same data, actually). Sadly both cache misses and computational intensive tasks are both considered CPU-bound scenarios, while tbh they can be frontend/backend bound instead, and although not I/O intensive, if backend-bound, HT can boost VT workload, but just because due to the nature of workload...</div><div dir="auto">That's why I suggest (for exploration) to use a proper profiler which can report cache misses or specific CPU events. <br><br><div class="gmail_quote" dir="auto"><div dir="ltr" class="gmail_attr">Il dom 15 ott 2023, 13:55 Robert Engels <<a href="mailto:rengels@ix.netcom.com" rel="noreferrer noreferrer" target="_blank">rengels@ix.netcom.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto"><div dir="ltr"></div><div dir="ltr">In my HFT experience we never used HT cores. It was almost always slower. </div><div dir="ltr"><br></div><div dir="ltr">Here’s why. The kernel scheduler job is to allocate work to cores. The more cores the more management is has to do (context management) Usually this is ok because the increased number of cores runs more work. </div><div dir="ltr"><br></div><div dir="ltr">The latter point may not hold based on workload. The OS does not have visibility into what the work profile for a particular thread is - so if it scheduler essentially identical workloads (e.g all integer or all floating point) on two logical cores (same physical core) the physical can’t fully parallelize them (since they use the same components - typically is one core is blocked waiting on memory the other core can run a computation, etc)</div><div dir="ltr"><br></div><div dir="ltr">The end result is that the OS spends extra effort managing the work with no gain = slower.  </div><div dir="ltr"><br></div><div dir="ltr">My suggestion is to always turn off HT. </div><div dir="ltr"><br></div><div dir="ltr">Note that HT is very different on some architectures like RISC where the simple instructions and the pipeline make it easier to parallelize via shifting.   </div><div dir="ltr"><br><blockquote type="cite">On Oct 15, 2023, at 5:11 AM, Francesco Nigro <<a href="mailto:nigro.fra@gmail.com" rel="noreferrer noreferrer noreferrer" target="_blank">nigro.fra@gmail.com</a>> wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><div dir="auto">I suggest to use a profiler which can show more than the java side here, async profiler.<div dir="auto">But please beware the suggestion <a href="https://github.com/async-profiler/async-profiler/issues/779#issuecomment-1651104553" rel="noreferrer noreferrer noreferrer" target="_blank">https://github.com/async-profiler/async-profiler/issues/779#issuecomment-1651104553</a> from one of the Loom team's member.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il mer 11 ott 2023, 18:54 Michael van Acken <<a href="mailto:michael.van.acken@gmail.com" rel="noreferrer noreferrer noreferrer" target="_blank">michael.van.acken@gmail.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div dir="ltr">Given the huge difference of a factor of 2 in user time between the default and the nosmt setup, I tried to use jfr to find some metric that differs markedly between the two.  The workload is the same: the very same task is executed leading to the expected result.  This time it's 300 back to back compilations within a single java process.  Using the threadId() of a final virtual thread as proxy, ~570k threads seem to be utilized overall.<br><br>"jfr view hot-methods" does not show any significant difference, with the top entry being ForkJoinPool.awaitWork() at around 5.5% in both cases.<br><br>But "jfr view latencies-by-type" shows a large difference in its Total column for "Java Thread Park".  Could this be a clue where the user time accumulates?<br><br>### with "nosmt"<br><br>real 77.67<br>user 468.16<br>sys 13.48<br><br>jfr view latencies-by-type recording.jfr<br>                               Latencies by Type<br><br>Event Type                              Count Average    P 99 Longest     Total<br>-------------------------------------- ------ ------- ------- ------- ---------<br>Java Thread Park                       18.651 36,9 ms  310 ms  2,88 s 11 m 43 s<br>File Write                                  2 11,7 ms 12,6 ms 12,6 ms   23,4 ms<br><br>### default (without "nosmt")<br><br>real 93.60<br>user 824.12<br>sys 23.08<br><br>jfr view latencies-by-type recording.jfr<br>                               Latencies by Type<br><br>Event Type                               Count Average    P 99 Longest    Total<br>--------------------------------------- ------ ------- ------- ------- --------<br>Java Thread Park                        30.263 45,7 ms  256 ms  504 ms 23 m 2 s<br>File Read                                    1 10,9 ms 10,9 ms 10,9 ms  10,9 ms</div><div dir="ltr"><br></div></div>

</blockquote></div>

</div></blockquote></div></blockquote></div>

</div></div>