RFR: 8294003: Don't handle si_addr == 0 && si_code == SI_KERNEL SIGSEGVs

Wed Sep 21 13:59:42 UTC 2022

On Mon, 19 Sep 2022 12:33:44 GMT, Stefan Karlsson <stefank at openjdk.org> wrote:

> We have this code code in our signal handler:
> 
> 
> #ifndef AMD64
>     // Halt if SI_KERNEL before more crashes get misdiagnosed as Java bugs
>     // This can happen in any running code (currently more frequently in
>     // interpreter code but has been seen in compiled code)
>     if (sig == SIGSEGV && info->si_addr == 0 && info->si_code == SI_KERNEL) {
>       fatal("An irrecoverable SI_KERNEL SIGSEGV has occurred due "
>             "to unstable signal handling in this distribution.");
>     }
> #endif // AMD64
> 
> 
> This bug added that change:
> https://bugs.openjdk.java.net/browse/JDK-8004124
> 
> In the Generational ZGC we hit the exact same condition whenever we try to (incorrectly) dereference one of our colored pointers. From the bug above:
> 
> "A segmentation violation that occurs as a result of userspace process accessing virtual memory above the TASK_SIZE limit will cause a segmentation violation with an si_code of SI_KERNEL"
> 
> That is, if we have set high-order bits (past the TASK_SIZE limit), we get these kind of SIGSEGVs.
> 
> As the signal handle code is written today, we don't "stop" this signal, and instead try to handle it as an implicit null check. This causes hard-to-debug error messages and crashes in code that incorrectly try to deoptimize the faulty code.
> 
> I propose that we short-cut the signal handling code, and let this problematic SIGSEGV get passed to VMError::report_and_die.
> 
> We've been running with this patch in the Generational ZGC repository for over a year, without any problems.

I think x86_32 can/should do the same, because faulting on bona fide incorrect address currently produces a misleading error, see below. From the reading of JDK-8015837, JDK-8004124 and related issues, it looks like this code was added for x86_32 to better handle a kernel bug with exec-shield emulation on hardware without NX bit. But even then "better handle" seems to be only about crashing with more precise message.

I think only the ancient hardware runs without NX, and most kernels where this bug appears otherwise are long dead. So, I think we should favor faulting with proper error instead of telling (potentially misleading) things about "unstable signal handling".

$ lscpu
Model name:                      Intel(R) Atom(TM) CPU Z530   @ 1.60GHz

$ cat /etc/debian_version 
11.5

$ jdk/bin/java -version
openjdk version "20-testing" 2023-03-21
OpenJDK Runtime Environment (build 20-testing-builds.shipilev.net-openjdk-jdk-b210-20220919)
OpenJDK Server VM (build 20-testing-builds.shipilev.net-openjdk-jdk-b210-20220919, mixed mode, sharing)

$ cat Crash.java 
import java.lang.reflect.*;
import sun.misc.Unsafe;

public class Crash {
  public static void main(String... args) throws Exception {
    Field f = Unsafe.class.getDeclaredField("theUnsafe");
    f.setAccessible(true);
    Unsafe u = (Unsafe) f.get(null);
    u.getInt(-1L); // 0xF....F
  }
}

$ jdk/bin/java Crash.java
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (os_linux_x86.cpp:227), pid=1033, tid=1034
#  fatal error: An irrecoverable SI_KERNEL SIGSEGV has occurred due to unstable signal handling in this distribution.
#

-------------

PR: https://git.openjdk.org/jdk/pull/10340