GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream

Sun Mar 4 19:24:54 UTC 2018

Hi,

we've been encountering poor performance with -Xcheck:jni, for the
following example the performance is 140x to 510x slower with the flag
enabled:

>>>>
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.Random;
import java.util.zip.DeflaterOutputStream;

public final class CheckJniTest {
 static void deflateBytesPerformance() throws IOException {
   byte[] inflated = new byte[1 << 23];
   new Random(71917367).nextBytes(inflated);
   ByteArrayOutputStream deflated = new ByteArrayOutputStream();
   try (DeflaterOutputStream dout = new DeflaterOutputStream(deflated)) {
     dout.write(inflated, 0, inflated.length);
   }
   if (8391174 != deflated.size()) {
     throw new AssertionError();
   }
 }

 public static void main(String args[]) throws IOException {
   int n = 5;
   if (args.length > 0) {
     n = Integer.parseInt(args[0]);
   }
   for (int i = 0; i < n; i++) {
     long startTime = System.currentTimeMillis();
     deflateBytesPerformance();
     long endTime = System.currentTimeMillis();
     System.out.println("Round " + i + " took " + (endTime - startTime) +
                        "ms");
   }
 }
}
<<<<

The issue is in the libzip Deflater.c JNI code:
http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/java.base/share/native/libzip/Deflater.c#l131

The test has an 8MB array to deflate/compress. The DeflaterOutputStream has
an buffer of size 512 bytes:
http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/java.base/share/classes/java/util/zip/DeflaterOutputStream.java#l128

To compress the array, 16,384 native calls are made that use the 8MB input
array and the 512 byte output array. These arrays are accessed using
GetPrimitiveArrayCritical that with -Xcheck:jni copies the array:
http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/hotspot/share/prims/jniCheck.cpp#l1862
The copying of the arrays leads to 128GB of copying which dominates
execution time.

One approach to fix the problem is to rewrite libzip in Java. GNU Classpath
has such an implementation:
http://cvs.savannah.gnu.org/viewvc/classpath/classpath/java/util/zip/Deflater.java?view=markup#l417

A different approach is to use Get/SetByteArrayRegion (using
Get/SetByteArrayElements would be no different than the current approach
accept potentially causing more copies). I've included a patch and
performance data for this approach below where regions of the arrays are
copied onto a 512 byte buffer on the stack. The non -Xcheck:jni performance
is roughly equivalent before and after the patch, the -Xcheck:jni
performance is now similar to the non -Xcheck:jni performance.

The choice to go from a using GetPrimitiveArrayCritical to
GetByteArrayRegion is a controversial one, as engineers have many different
expectations about what critical means and does. GetPrimitiveArrayCritical
may have similar overhead to GetByteArrayElements if primitive arrays
(possibly of a certain size) are known to be non-moving. There may be a
cost to pin critical arrays or regions they exist within. There may be a
global or region lock that is in play that can cause interactions with the
garbage collector - such interactions may cause tail latency issues in
production environments. GetByteArrayRegion loses compared to
GetPrimitiveArrayCritical as it must always copy a region of the array for
access. Given these issues it is hard to develop a benchmark of
GetPrimitiveArrayCritical
vs GetByteArrayRegion that can take into account the GC interactions. Most
benchmarks will see that avoiding a copy can be good for performance.

For more background, Aleksey has a good post on GetPrimitiveArrayCritical
here:
https://shipilev.net/jvm-anatomy-park/9-jni-critical-gclocker/

A different solution to the performance problem presented here is to change
the check JNI implementation to do less copying of arrays. This would be
motivated if GetPrimitiveArrayCritical were expected to be used more widely
than GetByteArrayRegion in performance sensitive, commonly used code. Given
the range of problems possible with GetPrimitiveArrayCritical I'd
personally prefer GetByteArrayRegion to be more commonly used, as I'm yet
to see a performance problem that made GetPrimitiveArrayCritical so
compelling. For example, ObjectOutputStream has burnt me previously:
http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/java.base/share/native/libjava/ObjectOutputStream.c#l69
and these trivial copy operations, should really be a call to fix the
JIT/AOT compilers.

Next steps: it'd be great to get this turned in to a bug although its not
clear to me whether this is a JDK issue (as it uses
GetPrimitiveArrayCritical) or a HotSpot performance issue related to
-Xcheck:jni (hence the cross post). We're happy to contribute the attached
patch but there should be greater consistency in libzip, were it applied,
as there are uses of GetPrimitiveArrayCritical in the
inflation/uncompressing code and elsewhere.

Thanks,
Ian Rogers - Google

Current non -Xcheck:jni performance:
Round 0 took 382ms
Round 1 took 340ms
Round 2 took 303ms
Round 3 took 256ms
Round 4 took 258ms
Round 5 took 255ms
Round 6 took 260ms
Round 7 took 257ms
Round 8 took 253ms
Round 9 took 246ms
Round 10 took 246ms
Round 11 took 247ms
Round 12 took 245ms
Round 13 took 245ms
Round 14 took 246ms
Round 15 took 244ms
Round 16 took 268ms
Round 17 took 248ms
Round 18 took 247ms
Round 19 took 247ms

Current -Xcheck:jni performance:
Round 0 took 99223ms
Round 1 took 56829ms
Round 2 took 55620ms
Round 3 took 55558ms
Round 4 took 55893ms
Round 5 took 55779ms
Round 6 took 56039ms
Round 7 took 54443ms
Round 8 took 126490ms
Round 9 took 124516ms
Round 10 took 125497ms
Round 11 took 125022ms
Round 12 took 126630ms
Round 13 took 122399ms
Round 14 took 125623ms
Round 15 took 125574ms
Round 16 took 123621ms
Round 17 took 125374ms
Round 18 took 124973ms
Round 19 took 125248ms

Patched non -Xcheck:jni performance:
Round 0 took 363ms
Round 1 took 382ms
Round 2 took 354ms
Round 3 took 274ms
Round 4 took 265ms
Round 5 took 264ms
Round 6 took 266ms
Round 7 took 255ms
Round 8 took 259ms
Round 9 took 245ms
Round 10 took 245ms
Round 11 took 245ms
Round 12 took 245ms
Round 13 took 244ms
Round 14 took 244ms
Round 15 took 245ms
Round 16 took 358ms
Round 17 took 318ms
Round 18 took 311ms
Round 19 took 251ms

Patched -Xcheck:jni performance:
Round 0 took 385ms
Round 1 took 373ms
Round 2 took 366ms
Round 3 took 297ms
Round 4 took 301ms
Round 5 took 303ms
Round 6 took 304ms
Round 7 took 308ms
Round 8 took 298ms
Round 9 took 287ms
Round 10 took 290ms
Round 11 took 396ms
Round 12 took 368ms
Round 13 took 328ms
Round 14 took 283ms
Round 15 took 282ms
Round 16 took 296ms
Round 17 took 283ms
Round 18 took 283ms
Round 19 took 281ms