RFR: Support yield on virtual thread on EPollSelector select() [v2]

Alan Bateman alanb at openjdk.java.net
Thu Apr 28 05:19:59 UTC 2022


On Thu, 28 Apr 2022 02:55:41 GMT, joeyleeeeeee97 <duke at openjdk.java.net> wrote:

>> Proposal: suport selector poll in loom
>> Hi, supporting selector poll in loom could enhance compatibility and improve performance based on my experiments with spring  benchmarks.
>> Many frameworks are using reactor-based networking, for those blocking `Selector.select()` calls currently loom's implementation is blocking the current ForkJoin worker until these calls are returned. For two reasons current virtual threads can't yield out, first is Selector is using synchronized, second is epoll_wait must fall into the kernel. By replacing synchronized with java locks and delegating blocking `Selector.select()` to background threads, this patch supports yielding on `EpollSelector's epoll_wait`. (Similar solutions for coroutine's `Selector.select()` have been widely verified on the production environment in Alibaba).
>> Hopefully, my benchmarks could be easily reproduced and modified for verification, for spring I have to hack the Thread.start() to start those bundled workers as virtual threads.
>> 
>> In summary, my observations are supporting coroutine yield on virtual threads `Selector.select()` could boost performance when
>> 1. Selector threads(in charge of Selector.select()) are likely to block
>> 2. there are a large number of selector threads 
>> 
>> Numbers:
>> What did each column mean?
>> ● spring(no suffix) -> the default platform thread mode
>> ● spring-virtualbiz -> all executor threads are converted to virtual threads
>> ● spring-virtual -> all IO threads and executor threads are converted to virtual threads
>> ● spring-virtualoptimized -> all IO threads and executor threads are converted to virtual threads and enable this patch
>> 
>> 
>> Spring: 
>> 
>> Configuration:
>> 1. CPU number of Selector threads, each request's handler will be submitted to an executor pool. 
>> 2. Using "wrk" to add pressure, measure latencyAvg and totalRequests each run.
>> 3. Skipped database-related workloads for spring, currently,  they might introduce frequent pinning due to synchronized block.
>> 
>> Summary: 
>> Converting all executor threads to virtual threads gains more performance than the default, but degradation happens when all threads including IO threads are converted to virtual threads. 
>> 
>> For total requests(throughput) in fixed time, this patch brings **3-7%** improvement on plaintext and **6-14%** improvement on JSON. For average latency, optimize **4-8%** on the plaintext and about 15% on JSON. 
>> 
>> 1. We are measuring requests and latency at the same time, so for each run latency and throughput both improved.
>> 2. The percentage is computed via comparing optimized version with current optimal mode(virtualBiz)
>> 
>> +------------------------------------------------------------------------------------+
>> |                       Type: plaintext, Result: totalRequests                       |
>> +----------+---------+-------------------+-----------------+-------------------------+
>> | pipeline |  spring | spring-virtualbiz | spring-virtual  | spring-virtualoptimized |
>> +----------+---------+-------------------+-----------------+-------------------------+
>> |    4     | 1105401 |      1146776      |     1219481     |         1233398         |
>> |    8     | 1304546 |      1313621      |     1234331     |         1406274         |
>> |    16    | 1465894 |      1474328      |     1245648     |         1520790         |
>> +----------+---------+-------------------+-----------------+-------------------------+
>> +--------------------------------------------------------------------------------------------+
>> |                             Type: json, Result: totalRequests                              |
>> +------------------+---------+-------------------+-----------------+-------------------------+
>> | concurrencyLevel |  spring | spring-virtualbiz | spring-virtual  | spring-virtualoptimized |
>> +------------------+---------+-------------------+-----------------+-------------------------+
>> |        4         |  592597 |       655380      |      698312     |          622674         |
>> |        8         |  906588 |      1014115      |     1060094     |         1016332         |
>> |        16        | 1051602 |      1050818      |     1079971     |         1140643         |
>> |        32        | 1116081 |      1088898      |     1109657     |         1280262         |
>> |        64        | 1184019 |      1148817      |     1096397     |         1360887         |
>> |       128        | 1293541 |      1358471      |     1119223     |         1440368         |
>> +------------------+---------+-------------------+-----------------+-------------------------+
>> 
>> 
>> 
>> How to reproduce:
>> Environment
>> Two 4C8G ECS machines on the cloud serve as the java application while the other is in charge of the wrk pressure client.
>> Based on the [upstream](https://github.com/TechEmpower/FrameworkBenchmarks) Frameworkbenchmark I did some work to adapt this to the loom, now on [the loom](https://github.com/joeyleeeeeee97/FrameworkBenchmarks/) branch we could easily set up a benchmark run to verify this.
>> ● Quickstart( runs on a docker-machine)
>> 
>> 
>> git clone https://github.com/joeyleeeeeee97/FrameworkBenchmarks.git
>> cd FrameworkBenchmarks
>> ./tfb --test spring-optimized --duration 60
>> 
>> 
>> Notice
>> ● if there are socket errors in wrk, then maybe current stress level is too much
>> ● I was using separate machines as client and server following this [guide](https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Benchmarking-Getting-Started)
>> ● Use configuration to adjust test duration and concurrency level
>> --concurrency-levels=[4, 8, 16, 32, 64, 128] --pipeline-concurrency-levels=[128, 256, 512, 1024] --duration 30
>> Modifications
>> ● Most of my modifications are open now except for the process of building docker image and the hacking part to run spring on loom
>> 
>> Tests
>> ● jtreg java/nio/channels/Selector/ to make sure refactoring Selector synchronized didn't break tests
>> ● jtreg test/jdk/java/lang/Thread/virtual/Selectors.java to verify Selector behavior on virtual threads
>> 
>> 
>>     @Test
>>     public void testSelectorMounted() throws Exception {
>>         var selectorThread =  Thread.ofVirtual().start(() -> {
>>             try {
>>                 Selector selector = Selector.open();
>>                 selector.select();
>>             } catch (Exception ignored) {
>>             }
>>         });
>>         Thread.sleep(200);
>>         // virtual threads are waiting now :)
>>         assertEquals(selectorThread.getState(),
>>                 (Boolean.parseBoolean(System.getProperty("jdk.useRecursivePoll"))? Thread.State.WAITING : Thread.State.RUNNABLE));
>>         selectorThread.interrupt();
>>         selectorThread.join();
>>     }
>
> joeyleeeeeee97 has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision:
> 
>   Support yield on virtual thread on EPollSelector select()

Selection operations are specified to synchronize on the Selector so we can't change the locking without incompatible changes to the specification. So yes, selection operation on a virtual threads pin the carrier. It is something that can be re-examined once restriction on Java monitors is addressed.

-------------

PR: https://git.openjdk.java.net/loom/pull/166


More information about the loom-dev mailing list