Performance Questions and Poller Implementation in Project Loom

Wed Nov 1 21:21:35 UTC 2023

Thanks Ilya, I sadly agree with your observation: nowadays is difficult to
find anything better for an apple to apple comparison...
In Quarkus, my mate Eric De Andrea is going to provide a "quarkus super
heroes" (in collaboration of my team - Performance MW team) benchmark, but
is still Quarkus only, although very realistic in term of mix of used
technologies (and it will have a reactive vs blocking part too).  OT
finished, I swear.

Returning on the lack of cache friendly behaviour while directly
interacting with sockets, I can give my 2c...

I got very mixed feelings about thread per core architectures (with shared
nothing approaches) vs what loom offer, and these are my points, related
your fair observation:
- a thread per core (a la Netty, let's say but Hazelcast is the same, or in
the C++ world, the seastar framework) approach allows to bind every local
access, including socket file descriptors ones, to be numa friendly - by
setting affinity of a specific event loop thread to a specific numa node &
core, which is awesome 👍 and tremendously effective for HFT or where tail
latency really matter
- BUT, a thread per core approach requires, to work at its best, a fair
distribution of resources/load, and "rebalancing" such isn't something
automatic: Netty, for example, doesn't allow to "move" file descriptors
across event loops if some event loop have more free cycles to spare and
could pickup that work :"/ additionally, the thread confined lifecycles of
sockets makes connection pooling with reactive database drivers to be
unable to guarantee to every event loop access to all the available
database connections (which are scattered and partitioned statically among
event loops) with the same performance: if an event loop which is serving a
client request requires to use a database connection, but has exhausted the
ones belonging to its event loop, could use the ones in another event loop,
but it needs 2 thread handoffs back and forth while doing it, which is a
penalty which Loom doesn't have: every virtual thread can hit a cache miss,
but no threads handoff while interacting with any borrowed db connection.

The last point is very specific of Netty, indeed Hazelcast afaik allow
migration of file descriptors across event loops, but is still a periodic
operation and never as natural as just acquire a resource and use it.
If I have to weigh between 2 thread handoffs and the cache unfriendly
"cold" access to a socket, I will probably pick the second one.
What's interesting is that in the 1-2 available full cores of this new
container world, probably both effects just fade away, or, are less
important.

Il mer 1 nov 2023, 21:22 Ilya Starchenko <st.ilya.101 at gmail.com> ha scritto:

> On 1 Nov 2023 at 01:51:44, Francesco Nigro <nigro.fra at gmail.com> wrote:
>
>> Techempower plaintext is highly pipelined (in the worst way, because is
>> http 1.1 and NOT http 2, which is designed for that) and CPU bound, due to
>> http encoding/decoding, especially if the framework is a "proper" one (see
>> my rant at
>> https://github.com/TechEmpower/FrameworkBenchmarks/discussions/7984) and
>> materialize properly the headers; which means that an improvement in that
>> part can be the responsible to achieve better numbers in techempower
>
>
> Franz,
>
>
> Thank you for the clarification. I've already noticed that some of the
> Techempower benchmarks don't accurately represent real-world scenarios, but
> I haven't found another benchmark that would be more representative. I'll
> try profiling and perhaps look for alternative benchmarks (I've heard that
> the Quarkus team is working on some benchmarks).
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20231101/1e5f4a01/attachment-0001.htm>