Jetty and Loom

Mon Jan 4 17:55:39 UTC 2021

Hey Alan,

this is going to be a bit of a monster reply on several topics... but no
easier way then just going for it....
see inline below:

On Mon, 4 Jan 2021 at 16:04, Alan Bateman <Alan.Bateman at oracle.com> wrote:

> The experiment with the CometD chat is interesting. Would it be possible
> for upcoming part 3 to include the code, or links, for the both the
> async and simpler synchronous examples? Loom is all about code that is
> simpler to maintain, debug, and profile and will be interesting to see
> if that is so without sacrificing scalability compared to the async code
> in this example. If I read the output correctly in part 2 then the
> duration of 1000 client run is about 10s (do I have that right?) and you
> might need to run for longer to ensure that everything is warmed up.
>

Currently we have just been using stock Cometd, since it already had the
option to pick between blocking IO or async IO transports. However, as I
posted in reddit, we now realize that in order to get the best benefit from
Loom we need  to write a new transport for cometd that doesn't even use
async servlets, so there will  be no async at all in that application.

That code will of course be available and hopefully will just become part
of a standard release.

However, I'm not so sure that it will be a great example of code that is
simplified by blocking vs async.   The nature of the app is that it is
already event based anyway, so it is a good fit for async and either way,
we've already abstracted out all the differences to a Transport
abstraction.   So I don't think it will really show-case the full possible
horror show that writing async applications can be.  But that said, I'm
sure the transports themselves will be a reasonable example, as we will
have: blocking servlet with blocking IO; async servlet with blocking IO;
and async servlet with async IO. All doing the same Transport abstraction,
so I think the simplicities of blocking vs async will still be on show.

Note that with the chat server application we are using for the benchmark,
there is no actual server side application. All the semantics needed for a
chat room app can be done client side using the core parts of cometd on the
server.   In reality, that would never be the case and real business logic
is also implemented server side and that is where I think Loom can bring
simplicity.

The experiment with call stacks with 1000+ frames does demonstrate that
> virtual threads can run existing code but it does beg the question as to
> whether some of these bloated libraries are important when working in
> the small. The same goes for thread locals that are caching large graphs
> of objects (I assume the "latency statistics" TL that you mention must
> be large). There is exploratory work under way to provide alternative
> solutions for some of today's uses of TLs and the _nonBlocking TL in the
> linked code is an example of an idiom that the scope variables will help
> with. TLs are a very general mechanism and there will be cases where the
> usages needs to be re-examined by the maintainer of the code (Loom will
> have diagnostics options to help identify TL usages in virtual threads).
>

Oh having been providing an application container for over 25 years, I can
definitely tell you that those bloated libraries and stupidly deep calls
stacks are a thing, regardless of the importance or need for them.   There
is just a certain class of development process for which no problem cannot
be solved by adding a few more dependencies.  Needless bloat is a thing
which is no more going to be solved by Loom than it was by async or
reactive or dependency injection or buzzword coding style.     The bloat
examples just show that Loom is not a silver bullet and cannot magic away
deep stacks.... but then nothing can.  It's not a criticism of Loom, just
an observation of reality and that it's not a silver bullet... ie a bit of
temperance on the "just spawn another thread" meme .

Using thread pools to limit concurrency is a discussion in itself. It
> works for the JDBC example because it is limited to 100 connections in
> the test but it may not be the right level of granularity for other
> usages, esp. fan-out to other services where the number of network
> connections can significantly out number any limit you put on the number
> of threads. Also thread pools are not without their issues. The cost of
> starting threads that you discuss is one, but there is also the issue of
> TLs outliving tasks, and subtle issue of trying to cancel/interrupt at
> around the time that a task completes and a thread is borrowed for
> another task.
>

Again, I'm not saying that Thread pools are silver bullets that solve all
problems.  I'm just saying that there are other reasons for using them
other than just because kernel threads are slow to start.  In the right
circumstance, a thread pool (or an executor that limits concurrent
executions) is a good abstraction.      My stress on them is in reaction to
the "forget about thread pools" meme, which I think would be better as a
little less shiny projectile sounding.

As a tangent, I am intrigued by some of the claims from Loom regarding
cancellation and interrupt.   The ability to cancel tasks or impose
deadlines is definitely a very nice feature to have, but I can't see how
cancelling/interrupting a virtual thread is any easier than
cancelling/interrupting a kernel thread.  You still have all the finally
blocks to unwind and applications that catch and ignore ThreadDeath
exceptions.    It didn't work for kernel threads, so I can't see what is
different for virtual threads.... but I REALLY hope I'm wrong and that Loom
has discovered a safe way to cancel a task.

> Can you expand a bit more on the "eat what you kill scheduling" issue? I
> can't quite tell if you ran into an issue or not. For now at least, if a
> virtual thread invokes a selection operation then it will pin the
> carrier thread (this is due to the way that selection operations are
> specified rather than anything else) but it does use the FJP
> managedBlocker mechanism to increase parallelism during the operation. I
> can't quite tell if the "postponed indefinitely" is a reference to the
> first thread that invokes the selection operation (before consuming one
> selection key) or followers that consume the selected keys not handled
> by the first thread.
>

When jetty implemented HTTP/2 we discovered some forms of both deadlock and
livelock on high performance high volume systems.  These problems also
exist in HTTP/1 to some extent, but HTTP/2 exacerbated them by moving flow
control into user space.   Let me see if I can describe a simple example in
HTTP/1 land for a server that was totally implemented in virtual threads -
that block reading from a connection and then handle the request read.
Let's say we've got 1000 active virtual threads being serviced by 16 kernel
threads.

If an application has lots of requests are waiting for some special message
to arrive, but they use synchronize/wait/notify for that waiting, then
that's not support by Loom, so you'd only need 16 requests to the
application to wait like that before all 16 of you kernel threads are now
blocked in something Loom doesn't support.  So even though the special
message arrives that would unblock them all AND there is a virtual thread
blocked in a read for that message, there is no kernel thread available to
progress that read, so everybody remains blocked and the message is never
received and the server is dead locked.       So sure, the application
should not use synchronise if it is using Loom... but containers are not in
control of how their applications are written.  Even applications are not
in control of what other applications might be deployed on the same
server.

However, I assume that synchronise will ultimately be supported, so that
particular deadlock will not be a problem.  However, without preemption,
you'd still only need 16 CPU bound threads in one application to take all
the available kernel threads deny (or at least delay) vital messages being
read for all other virtual threads, even those in different applications.
 Also the lack of preemption dramatically increases the potential for live
lock as fewer tasks can dominate.

Jetty has put a lot of effort into our scheduler so that we can identify
high priority tasks like unblocking IO blocked threads or parsing and
handling HTTP/2 flow control.  But it is not sufficient just to have these
things done in their own kernel thread and everything else done in a
virtual thread because by the time you discover what kind of task it is -
ie flow control vs a new request, it can be too late to efficiently hand
off to another thread of any kind, because your CPU cache is now hot with
that request and any hand off carries a high risk of that task being
handled by a CPU with a cold cache (see
https://webtide.com/avoiding-parallel-slowdown-in-jetty-9/).

It may well be that eventually virtual threads improve and that we think of
new techniques so that a server like jetty could be implemented in kernel
threads.  However I think we are a long way from that today.

In short, I think that adding yet another level of "virtual" between tasks
and CPU cores just makes it even harder to do the mechanical sympathetic
optimizations that are needed for high performance servers.   But then I'm
not sure the innards of a server like Jetty is the primary target for Loom
anyway and that it is more targeted at allowing business logic to be simply
written and to take advantage of all the smarts in the server without the
complexity of async APIs

> The approach to off-load from the Jetty core threads to virtual threads
> that execute the user's code seems reasonable, at least from a distance.
> It will at least be expedient as you already have an async core that you
> want to keep. Helidon MP have something similar where it can run in a
> mode that creates a virtual thread to execute the user/application code.
> The important thing is that the application isn't forced into using
> callbacks or split up code into stages in order to avoid blocking
> operations.
>

Indeed.  Currently any application that wants to scale to large levels of
concurrency has to a) go async; b) manage a lot more memory and other
resource limitations; c) fix all the bugs/races in their async
implementation; d) discover new resources they didn't even know they were
using that need to be managed; e) refix all the bugs/races in their async
code; f) be surprised when 5 years later a new async bug/race is discovered!

Even if Loom only ever applies to the application, getting rid of a, c, e &
f is a good thing.

cheers

-- 
Greg Wilkins <gregw at webtide.com> CTO http://webtide.com