[foreign-memaccess] RFR: 8253063: ScopedAccessError is sometimes thrown spuriously

Fri Sep 11 18:20:46 UTC 2020

On Fri, 11 Sep 2020 17:49:53 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

> After running the test for shared segment handshake during unrelated work, I noted that the test exhibit two kind of
> failure modes:
> 1. First, the test can sometimes fail with a raw ScopedAccessError
> 2. More rarely, the test sometimes even crash (this only happens in the vectorized mismatch test)
> 
> After some investigation carried out by @fisk, turns out that (2) was caused by the fact that when we exit from
> deoptimization we don't check for pending async exception (while we check for regulart ones). This condition was not
> triggered in other tests because, it seems, plain unsafe accesses are effectively atomic once intrinisfied - e.g. there
> can be no safepoint between check and access. But the vectorized mismatch routine is a biggie Java routine which gets
> some initial compilation (by C1) - and this is when the bug hit.  The problem (2) was more difficult to pinpoint -
> basically, there are some routines in the VM which are hostile w.r.t. async exceptions - and are marked with the
> special `JRT_ENTRY_NO_ASYNC` entry. One of those routines happens to be the one that throws exceptions, and this
> routine can safepoint too. Now, since the scope memory access can also throw (if the scope is not alive) it is possible
> to end up in a situation where we try to install an async exception while the VM code was already in the middle of
> installing one. We thought we covered this case by checking for *pending exceptions* - but the check we had was too
> weak. After trying with different approaches we agreed to address this more explicitly - by having the exception code
> set a flag on the running thread - so that we can precisely determine, when we do the handshake, as to whether the
> async exception should be installed or not.  With these fixes, the test passes and no more spurious failures can be
> observed (I left it running for a long time ;-)). I've beefed up the test by always having it use max number of
> available processors, and also by lowering the delay time associated with the closing thread. The longer the wait, the
> less frequently transient issues such as (1) can be observed.

This looks good. Sorry I didn't catch this earlier.

-------------

Marked as reviewed by eosterlund (no project role).

PR: https://git.openjdk.java.net/panama-foreign/pull/323