GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream

Mon Mar 5 06:33:20 UTC 2018

<swapped jdk-dev for core-libs-dev>

Hi Ian,

Do you run with -Xcheck:jni in production mode because you load unknown 
native code libraries? It's mainly intended as a diagnostic option to 
turn on if you encounter a possible JNI problem.

I'll leave the debate on your actual patch proposal to others more 
familiar with the library code.

Thanks,
David

On 5/03/2018 5:24 AM, Ian Rogers wrote:
> Hi,
> 
> we've been encountering poor performance with -Xcheck:jni, for the
> following example the performance is 140x to 510x slower with the flag
> enabled:
> 
>>>>>
> import java.io.ByteArrayOutputStream;
> import java.io.IOException;
> import java.util.Random;
> import java.util.zip.DeflaterOutputStream;
> 
> public final class CheckJniTest {
>   static void deflateBytesPerformance() throws IOException {
>     byte[] inflated = new byte[1 << 23];
>     new Random(71917367).nextBytes(inflated);
>     ByteArrayOutputStream deflated = new ByteArrayOutputStream();
>     try (DeflaterOutputStream dout = new DeflaterOutputStream(deflated)) {
>       dout.write(inflated, 0, inflated.length);
>     }
>     if (8391174 != deflated.size()) {
>       throw new AssertionError();
>     }
>   }
> 
>   public static void main(String args[]) throws IOException {
>     int n = 5;
>     if (args.length > 0) {
>       n = Integer.parseInt(args[0]);
>     }
>     for (int i = 0; i < n; i++) {
>       long startTime = System.currentTimeMillis();
>       deflateBytesPerformance();
>       long endTime = System.currentTimeMillis();
>       System.out.println("Round " + i + " took " + (endTime - startTime) +
>                          "ms");
>     }
>   }
> }
> <<<<
> 
> The issue is in the libzip Deflater.c JNI code:
> http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/java.base/share/native/libzip/Deflater.c#l131
> 
> The test has an 8MB array to deflate/compress. The DeflaterOutputStream has
> an buffer of size 512 bytes:
> http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/java.base/share/classes/java/util/zip/DeflaterOutputStream.java#l128
> 
> To compress the array, 16,384 native calls are made that use the 8MB input
> array and the 512 byte output array. These arrays are accessed using
> GetPrimitiveArrayCritical that with -Xcheck:jni copies the array:
> http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/hotspot/share/prims/jniCheck.cpp#l1862
> The copying of the arrays leads to 128GB of copying which dominates
> execution time.
> 
> One approach to fix the problem is to rewrite libzip in Java. GNU Classpath
> has such an implementation:
> http://cvs.savannah.gnu.org/viewvc/classpath/classpath/java/util/zip/Deflater.java?view=markup#l417
> 
> A different approach is to use Get/SetByteArrayRegion (using
> Get/SetByteArrayElements would be no different than the current approach
> accept potentially causing more copies). I've included a patch and
> performance data for this approach below where regions of the arrays are
> copied onto a 512 byte buffer on the stack. The non -Xcheck:jni performance
> is roughly equivalent before and after the patch, the -Xcheck:jni
> performance is now similar to the non -Xcheck:jni performance.
> 
> The choice to go from a using GetPrimitiveArrayCritical to
> GetByteArrayRegion is a controversial one, as engineers have many different
> expectations about what critical means and does. GetPrimitiveArrayCritical
> may have similar overhead to GetByteArrayElements if primitive arrays
> (possibly of a certain size) are known to be non-moving. There may be a
> cost to pin critical arrays or regions they exist within. There may be a
> global or region lock that is in play that can cause interactions with the
> garbage collector - such interactions may cause tail latency issues in
> production environments. GetByteArrayRegion loses compared to
> GetPrimitiveArrayCritical as it must always copy a region of the array for
> access. Given these issues it is hard to develop a benchmark of
> GetPrimitiveArrayCritical
> vs GetByteArrayRegion that can take into account the GC interactions. Most
> benchmarks will see that avoiding a copy can be good for performance.
> 
> For more background, Aleksey has a good post on GetPrimitiveArrayCritical
> here:
> https://shipilev.net/jvm-anatomy-park/9-jni-critical-gclocker/
> 
> A different solution to the performance problem presented here is to change
> the check JNI implementation to do less copying of arrays. This would be
> motivated if GetPrimitiveArrayCritical were expected to be used more widely
> than GetByteArrayRegion in performance sensitive, commonly used code. Given
> the range of problems possible with GetPrimitiveArrayCritical I'd
> personally prefer GetByteArrayRegion to be more commonly used, as I'm yet
> to see a performance problem that made GetPrimitiveArrayCritical so
> compelling. For example, ObjectOutputStream has burnt me previously:
> http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/java.base/share/native/libjava/ObjectOutputStream.c#l69
> and these trivial copy operations, should really be a call to fix the
> JIT/AOT compilers.
> 
> Next steps: it'd be great to get this turned in to a bug although its not
> clear to me whether this is a JDK issue (as it uses
> GetPrimitiveArrayCritical) or a HotSpot performance issue related to
> -Xcheck:jni (hence the cross post). We're happy to contribute the attached
> patch but there should be greater consistency in libzip, were it applied,
> as there are uses of GetPrimitiveArrayCritical in the
> inflation/uncompressing code and elsewhere.
> 
> Thanks,
> Ian Rogers - Google
> 
> Current non -Xcheck:jni performance:
> Round 0 took 382ms
> Round 1 took 340ms
> Round 2 took 303ms
> Round 3 took 256ms
> Round 4 took 258ms
> Round 5 took 255ms
> Round 6 took 260ms
> Round 7 took 257ms
> Round 8 took 253ms
> Round 9 took 246ms
> Round 10 took 246ms
> Round 11 took 247ms
> Round 12 took 245ms
> Round 13 took 245ms
> Round 14 took 246ms
> Round 15 took 244ms
> Round 16 took 268ms
> Round 17 took 248ms
> Round 18 took 247ms
> Round 19 took 247ms
> 
> Current -Xcheck:jni performance:
> Round 0 took 99223ms
> Round 1 took 56829ms
> Round 2 took 55620ms
> Round 3 took 55558ms
> Round 4 took 55893ms
> Round 5 took 55779ms
> Round 6 took 56039ms
> Round 7 took 54443ms
> Round 8 took 126490ms
> Round 9 took 124516ms
> Round 10 took 125497ms
> Round 11 took 125022ms
> Round 12 took 126630ms
> Round 13 took 122399ms
> Round 14 took 125623ms
> Round 15 took 125574ms
> Round 16 took 123621ms
> Round 17 took 125374ms
> Round 18 took 124973ms
> Round 19 took 125248ms
> 
> Patched non -Xcheck:jni performance:
> Round 0 took 363ms
> Round 1 took 382ms
> Round 2 took 354ms
> Round 3 took 274ms
> Round 4 took 265ms
> Round 5 took 264ms
> Round 6 took 266ms
> Round 7 took 255ms
> Round 8 took 259ms
> Round 9 took 245ms
> Round 10 took 245ms
> Round 11 took 245ms
> Round 12 took 245ms
> Round 13 took 244ms
> Round 14 took 244ms
> Round 15 took 245ms
> Round 16 took 358ms
> Round 17 took 318ms
> Round 18 took 311ms
> Round 19 took 251ms
> 
> Patched -Xcheck:jni performance:
> Round 0 took 385ms
> Round 1 took 373ms
> Round 2 took 366ms
> Round 3 took 297ms
> Round 4 took 301ms
> Round 5 took 303ms
> Round 6 took 304ms
> Round 7 took 308ms
> Round 8 took 298ms
> Round 9 took 287ms
> Round 10 took 290ms
> Round 11 took 396ms
> Round 12 took 368ms
> Round 13 took 328ms
> Round 14 took 283ms
> Round 15 took 282ms
> Round 16 took 296ms
> Round 17 took 283ms
> Round 18 took 283ms
> Round 19 took 281ms
>