Performance Questions and Poller Implementation in Project Loom

Tue Oct 31 19:51:44 UTC 2023

Hi Ilya,

I am one of developers which have improved (by more than 30/50%)
Netty/Vertx/Quarkus exactly for that benchmark (and not only, the whole
http 1.1 stack) and the problem is ..people should start profiling before
driving conclusions about which component to blame (no pun intended, is
sadly the same for who has wrongly assumed that Loom is good using that
same benchmark as main proof). Techempower plaintext is highly pipelined
(in the worst way, because is http 1.1 and NOT http 2, which is designed
for that) and CPU bound, due to http encoding/decoding, especially if the
framework is a "proper" one (see my rant at
https://github.com/TechEmpower/FrameworkBenchmarks/discussions/7984) and
materialize properly the headers; which means that an improvement in that
part can be the responsible to achieve better numbers in techempower. If
the framework is "smart" enough (eg by cheating, not decoding the headers
received) the bottleneck than can move to the syscall cost (which I have
improved in Netty by using io_uring OR replacing read/write with
recv/send), but even thou, you still have the physical limits of the NIC,
which bound the max achievable throughput to ~7 M req/sec, making all high
level frameworks to look the same (again: without profiling CPU usage they
look the same).
Helidon, as Quarkus/Vertx/Netty and Undertow (which I know fairly well the
internals, and it non blocking for that test + have a very efficient http
decoding/encoding, better than Netty OOTB) are maxing out the CPU and there
is very few of loom in the profiling data, hence I would look elsewhere.
You can profile it fairly easy in a single thread too (being aware to
disable jvmti thread state notifications or will severely affect loom) and
verify my comments, in case.

Hope it has helped,

Franz

Il mar 31 ott 2023, 20:15 Ilya Starchenko <st.ilya.101 at gmail.com> ha
scritto:

> Hello loom-dev team,
>
>
> I would like to express my gratitude for the work being done on Project
> Loom. It's an exciting project with a lot of potential. I have some
> questions related to its performance that I would appreciate some
> clarification on.
>
>
> I recently came across a benchmark presentation by Alan Bateman at Devoxx<
> https://youtu.be/XF4XZlPZc_c?si=-Qp2PampTbNGj3a5>, where the Helidon Nima
> framework demonstrated better performance results compared to a reactive
> framework. However, when I examined the Plaintext benchmark (specifically
> focusing on Netty and Undertow, which benchmark only plaintext), I noticed
> that Nima, which operates entirely on virtual threads, failed to outperform
> even the blocking Undertow. Additionally, I conducted tests with Tomcat and
> Jetty using Loom's executor, and they also did not exhibit significant
> improvements compared to a reactive stack. Perhaps I should ask the Helidon
> team, but my question is, is this the expected performance level for
> Project Loom, or can we anticipate better performance in the future?
>
>
> Furthermore, I took a closer look at the Poller implementation<
> https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/sun/nio/ch/Poller.java#L436>,
> and I noticed that it utilizes only one thread (by default) for both read
> and write polling. I'm curious why there's only one thread, and wouldn't it
> be more efficient to have pollers matching the number of CPU cores for
> optimal performance?
>
>
> I look forward to your insights and guidance regarding these performance
> concerns. Your expertise and feedback would be greatly appreciated.
>
>
> - Ilya
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20231031/6fac1e3c/attachment.htm>