RFR: 8280468: Crashes in getConfigColormap, getConfigVisualId, XVisualIDFromVisual on Linux

Maxim Kartashev duke at openjdk.java.net
Tue Feb 8 05:48:04 UTC 2022


On Fri, 21 Jan 2022 17:02:38 GMT, Maxim Kartashev <duke at openjdk.java.net> wrote:

> These crashes were not reproducible, so the fix is based on a hypothesis that there are two possible reasons for them:
> 1. `makeDefaultConfig()` returning `NULL`.
> 2. A race condition when the number of screens changes.
> The race scenario: `X11GraphisDevice.makeDefaultConfiguration()` is called on EDT so the call can race with `X11GraphisDevice.invalidate()` that re-sets the screen number of the device; the latter is invoked on the `AWT-XAWT` thread from `X11GraphicsEnvironment.rebuildDevices()`. So by the time `makeDefaultConfiguration()` makes a native call with the device's current screen number, the `x11Screens` array maintained by the native code could have become shorter. And the native methods like `Java_sun_awt_X11GraphicsDevice_getConfigColormap()` assume that the number passed to them is always current and valid. The AWT lock only protects against the changes during the native methods invocation and does not protect against them being called with an outdated screen number. With a larger screen number, those methods read past the end of the `x11Screens` array.
> 
> The fix for (1) is to die gracefully instead of crashing in an attempt to de-reference a `NULL` pointer, which might happen upon returning from `makeDefaultConfig()` in `getAllConfigs()`.
> 
> The fix for (2) is to eliminate the race by protecting `X11GraphisDevice.screen` with the AWT lock such that it doesn't change when the native methods working with it are active.
> 
> We've been shipping JetBrains Runtime with this fix for a few months now and there were no crash reports with those specific patterns against the versions with the fix.

What's preventing me from going down the path of returning some default instead of locking are these two considerations:
1. By returning a default value when the screen number "suddenly" goes out range, the function will essentially lie. And it's going to take quite a while for me to prove that this lie will not have a long-lasting adverse effect.
2. The idea of testing if the fix actually cures the crash suggested by @mrserb will work, but only to prove that one particular hypothesis about the crash is correct. It won't help to prove that the whole process remains in a stable state and continues functioning correctly. But our our users kind of proved that already and I'm very reluctant to replaces what I believe is already working with what will probably work.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7182



More information about the client-libs-dev mailing list