[External] : Re: Custom scheduler: Using loom for deterministic simulation testing

Tue Jul 8 08:12:01 UTC 2025

Race detectors are great but struggle to find problems that span multiple services in a distributed system.

FoundationDB and Antithesis do seem to have had success with deterministic testing, although the way Antithesis works doesn't require language runtime support (I think they use a custom hypervisor). FDB used a coroutine framework to do it, and supposedly they did find plenty of bugs in sim.

The big question is not whether these techniques find bugs (they do), but to what extent it should be done at the language runtime level vs the operating system. The latter can flush out more bugs, but the former can be easier to deploy and might have better observability.
________________________________
From: Robert Engels <robaho at me.com>
Sent: Monday, July 7, 2025 9:16 PM
To: Mike Hearn <michael.hearn at oracle.com>
Cc: Ryan Yeats <ryeats at gmail.com>; loom-dev at openjdk.org <loom-dev at openjdk.org>; aoli at cs.cmu.edu <aoli at cs.cmu.edu>; quinn.klassen at temporal.io <quinn.klassen at temporal.io>
Subject: [External] : Re: Custom scheduler: Using loom for deterministic simulation testing

If you follow the ”JDK bug” you’ll see it’s not a bug…

Personally I think the whole idea of synchronous testing of asynchronous systems is a fools errand… but never the less, I think something like the race detector in Go would be far more useful than a sync testing framework.

On Jul 7, 2025, at 1:37 PM, Mike Hearn <michael.hearn at oracle.com> wrote:

Ryan, what I'd suggest is just clicking fork on GitHub and making a JDK spin with the changes needed. If you've never done it before, building a fork of the JDK is much easier than you'd expect. It's an easy codebase to work on, especially if the only changes you need are in the Java side. Then you can just directly alter the visibility of the various things needed, figure out what a good API looks like and submit a PR. Even if it's never merged, it would significantly refine the discussion around what's required.

For example, rather than patching the different time APIs with an agent it might make sense to load an InstantSource using an SPI and modify the implementations to route all time queries to that. A lot of software is sensitive to how fast System.nanoTime or currentTimeMillis executes, so that needs a bit of thought (maybe the compiler can be forced to always inline through the default implementation if no SPI is in use).

For deterministic IO, I don't think you want to operate at the level of sockets or file handles. That's too low level to make sense to developers in most cases. If you want to control when a reply comes back in from a database it'd be better to just use the DI framework to intercept the DataSource, from there intercept Connection, and then use regular synchronization tools to choose when the ResultSet is released back to the thread. If you try and do it at the level of individual reads/writes on a socket the test code will end up over-fitted to particular (undocumented) network protocols.

I haven't done that in my own deterministic testing framework because in most cases database access is fully synchronous and the concurrency problems arise inside the database itself. It might be interesting to play with H2 modifications to expose those.

________________________________
From: loom-dev <loom-dev-retn at openjdk.org> on behalf of Ryan Yeats <ryeats at gmail.com>
Sent: Saturday, July 5, 2025 10:25 PM
To: loom-dev at openjdk.org <loom-dev at openjdk.org>
Cc: aoli at cs.cmu.edu <aoli at cs.cmu.edu>; quinn.klassen at temporal.io <quinn.klassen at temporal.io>
Subject: Custom scheduler: Using loom for deterministic simulation testing

Hi,
I have also been looking into deterministic simulation on the side and belatedly saw some interest about it on the loom email list with the "Custom scheduler: Customize current time and timed waits" thread. I wanted to briefly expand on the topic of using Loom to make deterministic simulation testing more easily available to the java community.

BLUF:  Deterministic simulation testing makes multi threaded logic single threaded and fuzz tests the execution order, greatly simplifying discovery of race conditions and deadlocks. See this blog post by Ao Li<https://urldefense.com/v3/__https://aoli.al/blogs/jdk-bug/__;!!ACWV5N9M2RV99hQ!Lw7aOq2lvR5LhkxFBvkdr_eqruW27NMy85uO1MCKqxpQdd0nBs87uz9RgAeJAB5U7Sydas3u_Q4fow$> for an example doing this in java and the paper behind it  Fray: An Efficient General-Purpose Concurrency Testing Platform for the JVM<https://urldefense.com/v3/__https://arxiv.org/abs/2501.12618__;!!ACWV5N9M2RV99hQ!Lw7aOq2lvR5LhkxFBvkdr_eqruW27NMy85uO1MCKqxpQdd0nBs87uz9RgAeJAB5U7Sydas2Kt6ChfA$>.

The above explanation omits how hard it is to actually make java or any language deterministic. This is where James Baker had a clever idea<https://urldefense.com/v3/__https://jbaker.io/2022/05/09/project-loom-for-distributed-systems/__;!!ACWV5N9M2RV99hQ!Lw7aOq2lvR5LhkxFBvkdr_eqruW27NMy85uO1MCKqxpQdd0nBs87uz9RgAeJAB5U7Sydas23DG7yXg$> of using Loom, since we could replace the virtual thread scheduler with a deterministic one, Loom does most of the work for us without any extra frameworks. This works brilliantly, however there are three main caveats:

  1.  Its not easy to replace the virtual thread scheduler
  2.  The aforementioned scheduler has no understanding of delays so any Thread.sleep or Object.wait introduces continuations back into the scheduled execution pool non-deterministically. This affects the execution order so bugs can't always be reproduced.
  3.  IO also introduces continuations back to the scheduler execution pool non-deterministically.

The first issue can currently be solved by using reflection to make the virtual thread constructor public<https://urldefense.com/v3/__https://github.com/ryeats/loom-dst/blob/main/src/main/java/org/example/SchedulableVirtualThreadFactory.java__;!!ACWV5N9M2RV99hQ!Lw7aOq2lvR5LhkxFBvkdr_eqruW27NMy85uO1MCKqxpQdd0nBs87uz9RgAeJAB5U7Sydas28y46uBw$>. The second issue can currently be solved by controlling system time by replacing the byte code for all calls to System.nanoTime(), System.onCurrentTimeMillis() and Instant.now() using an agent at runtime<https://urldefense.com/v3/__https://github.com/cmu-pasta/fray/blob/main/instrumentation/base/src/main/kotlin/org/pastalab/fray/instrumentation/base/visitors/TimeInstrumenter.kt__;!!ACWV5N9M2RV99hQ!Lw7aOq2lvR5LhkxFBvkdr_eqruW27NMy85uO1MCKqxpQdd0nBs87uz9RgAeJAB5U7Sydas1uc3Tmyg$>. The third likely has no general solution but I am interested in hearing ideas.

Even if there is no solution to IO non-determinism and developers have to stub out all IO, deterministic simulation is still an incredibly promising tool for making concurrent and distributed systems programing much simpler and safer. I think because of this use case there would be a lot of benefit if the Loom API allowed instrumenting the scheduler, even better if we could access the DELAYED_TASK_SCHEDULERS which understand delays.

Thank you for your time, I have been super excited to use virtual threads and amazed by the ingenuity that brought them to java.

Ryan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20250708/9c335d23/attachment-0001.htm>