Experience using virtual threads in EA 23-loom+4-102

Tue Jun 25 09:31:05 UTC 2024

It's important to remember that all of this is capped by Little's Law, Queuing Theory, Amdahl's Law (or rather, Universal Scalability Law). There's always going to be a tension between prioritizing finishing already started work and starting to process new work, and the less information there is about what the work entails, the less sophistication can be applied to achieve optimality.

As with all scalability-testing, it is important to be able to control tail latencies (i.e. avoiding optimizing for p50 at the expense of p90), which is only possible by either doing less work (load shedding, graceful degradation, or equivalent) or diverting work (dynamic scaling, nearest-gateway-routing, or equivalent).

At the end of the day, optimization tends to improve efficiency, alas there's always going to be a bottleneck which defines the ceiling for what is currently possible. Identifying said bottleneck (remember that it can move, depending on where most pressure is applied), tends to yield the best information on where to spend the most effort.

Cheers,
√

Viktor Klang
Software Architect, Java Platform Group
Oracle
________________________________
From: loom-dev <loom-dev-retn at openjdk.org> on behalf of Robert Engels <robaho at icloud.com>
Sent: Monday, 24 June 2024 22:12
To: Matthew Swift <matthew.swift at gmail.com>
Cc: loom-dev at openjdk.org <loom-dev at openjdk.org>
Subject: Re: Experience using virtual threads in EA 23-loom+4-102

I still think it might be helpful to use virtual threads for all connections (simplicity!) - but when to perform the cpu intensive work like hashing, put a callable/future on a “cpu only” executor with a capped number of platform threads and join(). It should be a trivial refactor of the code.

The problem with using VT for everything is that a VT is not time-sliced, so you could quickly consume all of the carrier threads and then you make no progress on the IO (fan out) requests - which is especially bad if they are simply calling out to other servers (less bad if doing lots of local disk io).

On Jun 24, 2024, at 12:05 PM, Matthew Swift <matthew.swift at gmail.com<mailto:matthew.swift at gmail.com>> wrote:

Thanks Robert.

The main issue we face with our application is that the client load can vary substantially over time. For example, we might experience a lot of CPU intensive authentication traffic (e.g. PBKDF2 hashing) in the morning, but then a lot of IO bound traffic at other times. It's hard to find the ideal number of worker threads: many threads work well for IO bound traffic, as you say, but sacrifices performance when the load is more CPU bound. On my 10 core (20 hyper threads) laptop, I observe nearly a 15-20% drop in throughput when subjecting the server to 1200 concurrent CPU bound requests, but a much smaller drop when using virtual threads:

* 10 platform threads: ~260K requests/s (note: this is too few threads for more IO bound traffic)
* 40 platform threads: ~220K requests/s
* 1200 platform threads: ~220K requests/s (note: this would be the equivalent of a one platform thread per request)
* virtual threads: 252K requests/s (note: FJ pool defaults to 20 on my laptop - I didn't try disabling hyperthreading).

I find the "one size fits all" provided by virtual threads to be much easier for developers and our users alike. I don't have to worry about complex architectures involving split thread pools (one for CPU, one for IO), etc. We also have to deal with slow misbehaving clients, which has meant use of async IO and hard to debug call-back hell :-) All of this goes away with virtual threads as it will allow us to use simpler blocking network IO and a simple one thread per request design that is much more tolerant to heterogeneous traffic patterns.

It also opens up the possibility of future enhancements that would definitely shine with virtual threads as you suggest. For example, modern hashing algorithms, such as Argon2, can take hundreds of milliseconds of computation, which is simply too costly to scale horizontally in the data layer. We want to offload this to an external elastic compute service, but we could very quickly have thousands of blocked platform threads with response times this high.

Cheers,
Matt

On Fri, 21 Jun 2024 at 19:29, robert engels <rengels at ix.netcom.com<mailto:rengels at ix.netcom.com>> wrote:
Hi,

Just an fyi, until you get into the order of 1k, 10k, etc. concurrent clients - I would expect platform threads to outperform virtual threads by quite a bit (best case be the same). Modern OS’s routinely handle thousands of active threads. (My OSX desktop with 4 true cores has nearly 5k threads running).

Also, if you can saturate your CPUs or local IO bus, adding more threads isn’t going to help. VirtualThreads shine when the request handler is fanning out to multiple remote services.

Regards,
Robert

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240625/24a6e2aa/attachment.htm>