[External] : Request for review: Potential race condition in GetPrimitiveArrayCritical causing data corruption (JDK 17u / ParallelGC)
Yuming Wang
yumwang at apache.org
Tue Feb 24 06:46:30 UTC 2026
Dear HotSpot GC and Updates Team,
I am writing to request a review of a potential root cause we have
identified for intermittent data corruption issues in our JDK 17 production
environments.
*The Symptom*
We run Apache Spark workloads on JDK 17 using *ParallelGC*. We observe
intermittent java.io.IOException: FAILED_TO_UNCOMPRESS(5) errors coming
from the Snappy native library. The stack traces indicate the failure
occurs during VectorizedColumnReader operations, where
GetPrimitiveArrayCritical is used to access on-heap byte arrays.
*Hypothesis & Analysis*
We suspect a race condition in jni_GetPrimitiveArrayCritical (specifically
in jni.cpp / lock_gc_or_pin_object).
In the non-pinning path (used by G1 and ParallelGC), the code currently
does:
// Current implementation
GCLocker::lock_critical(thread);
return JNIHandles::resolve_non_null(obj);
Our analysis suggests that because lock_critical does not strictly block an
ongoing GC (it only prevents a new one from starting), it is possible for a
GC cycle (specifically the compaction phase of ParallelGC) to move the
object after it has been resolved by JNIHandles but before the critical
section is effectively established for the native consumer. This would
result in the native code receiving a pointer to the object's old memory
address (stale pointer).
*Proposed Fix*
We have tested a patch that aligns the object locking mechanism with
jni_GetStringCritical by using a Handle to protect the object pointer:
// Proposed fix
Handle h(thread, JNIHandles::resolve_non_null(obj));
GCLocker::lock_critical(thread);
return h();
*Implementation & Verification*
We have implemented this fix and added a comprehensive test suite that
reproduces the issue. The commit can be reviewed here:
https://github.com/wangyum/jdk17u-dev/commit/bf7f679587683d6035e26af31a61b6b789e5a9bd
The test suite includes a native stress test that specifically triggers GC
within this critical window.
1. Without the fix: We can reproduce data corruption and detect object
address changes during the critical section using ParallelGC.
2. With the fix: The corruption disappears, and the native code
consistently accesses valid data.
*Request*
Could a VM expert please review this analysis and the proposed fix? We
would like to confirm:
1. Is our understanding of the lock_critical vs. object movement race
correct for ParallelGC?
2. Is the proposed Handle-based fix safe and appropriate for inclusion in
JDK 17u?
Thank you for your time and guidance.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/jdk-updates-dev/attachments/20260224/487be9d7/attachment-0001.htm>
More information about the jdk-updates-dev
mailing list