[OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste

Wed Apr 10 11:20:23 UTC 2013

Peter,

1/ I modified your TestRunner class to print total ops and perform explicit
GC before runTests:
http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/

2/ I applied your advice but it does not change much:

             -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal
-XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC
             >> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM
25.0-b24]
             #-------------------------------------------------------------
             # ContextGetInt4: run duration: 10 000 ms
             #
             # Warm up:
             #           4 threads, Tavg =     13,84 ns/op (σ =   0,23
ns/op), Total ops =   2889056179 [    13,93 (717199825),     13,87
(720665624),     13,48 (741390545),     14,09 (709800185)]
             #           4 threads, Tavg =     14,25 ns/op (σ =   0,57
ns/op), Total ops =   2811615084 [    15,21 (658351236),     14,18
(706254551),     13,94 (718202949),     13,74 (728806348)]
             cleanup (explicit Full GC) ...
             cleanup done.
             # Measure:
             1 threads, Tavg =      5,96 ns/op (σ =   0,00 ns/op), Total
ops =   1678357614 [     5,96 (1678357614)]
             2 threads, Tavg =      7,33 ns/op (σ =   0,03 ns/op), Total
ops =   2729723450 [     7,31 (1369694121),      7,36 (1360029329)]
             3 threads, Tavg =     10,65 ns/op (σ =   2,73 ns/op), Total
ops =   2817154340 [    13,24 (755190111),     13,23 (755920429),      7,66
(1306043800)]
             4 threads, Tavg =     15,44 ns/op (σ =   3,33 ns/op), Total
ops =   2589897733 [    17,05 (586353618),     19,23 (519345153),     17,88
(559401974),     10,81 (924796988)]

             -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal
-XX:-TieredCompilation -XX:+UseCompressedKlassPointers
-XX:+UseCompressedOops -XX:+UseParallelGC
             >> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM
25.0-b24]
             #-------------------------------------------------------------
             # GetInt4: run duration: 10 000 ms
             #
             # Warm up:
             #           4 threads, Tavg =     31,56 ns/op (σ =   0,43
ns/op), Total ops =   1267295706 [    31,30 (319512554),     31,02
(322293333),     32,12 (311334550),     31,82 (314155269)]
             #           4 threads, Tavg =     30,75 ns/op (σ =   1,81
ns/op), Total ops =   1302123211 [    32,21 (310949394),     32,37
(309275124),     27,87 (359125007),     31,01 (322773686)]
             cleanup (explicit Full GC) ...
             cleanup done.
             # Measure:
             1 threads, Tavg =      8,36 ns/op (σ =   0,00 ns/op), Total
ops =   1196238323 [     8,36 (1196238323)]
             2 threads, Tavg =     14,95 ns/op (σ =   0,04 ns/op), Total
ops =   1337648720 [    15,00 (666813210),     14,91 (670835510)]
             3 threads, Tavg =     20,65 ns/op (σ =   0,99 ns/op), Total
ops =   1453119707 [    19,57 (511100480),     21,97 (455262170),     20,54
(486757057)]
             4 threads, Tavg =     30,76 ns/op (σ =   0,54 ns/op), Total
ops =   1301090278 [    31,51 (317527231),     30,79 (324921525),     30,78
(325063322),     29,99 (333578200)]
             #
             << JVM END

3/ I tried several heap settings: without Xms/Xmx ... but it has almost no
effect.

*Should I play with TLAB resize / initial size ? or different GC collector
(G1 ...) ?

Does anybody can explain me what PrintTLAB mean ?*

Laurent

2013/4/10 Peter Levart <peter.levart at gmail.com>

>  Hi Laurent,
>
> Could you disable tiered compilation for performance tests? Tiered
> compilation is usually a source of jitter in the results. Pass
> -XX:-TieredCompilation to the VM.
>
> Regards, Peter
>
>
>
> On 04/10/2013 10:58 AM, Laurent Bourgès wrote:
>
>  Dear Jim,
>
> 2013/4/9 Jim Graham <james.graham at oracle.com>
>
>>
>> The allocations will always show up on a heap profiler, I don't know of
>> any way of having them not show up if they are stack allocated, but I don't
>> think that stack allocation is the issue here - small allocations come out
>> of a fast generation that costs almost nothing to allocate from and nearly
>> nothing to clean up.  They are actually getting allocated and GC'd, but the
>> process is optimized.
>>
>> The only way to tell is to benchmark and see which changes make a
>> difference and which are in the noise (or, in some odd counter-intuitive
>> cases, counter-productive)...
>>
>>                         ...jim
>>
>
> I advocate I like GC because it avoids in Java dealing with pointers like
> C/C++ does; however, I prefer GC clean real garbage (application...) than
> wasted memory:
> I prefer not count on GC when I can avoid wasting memory that gives GC
> more work = reduce useless garbage (save the planet) !
>
> Moreover, GC and / or Thread local allocation (TLAB) seems to have more
> overhead than you think = "fast generation that costs almost nothing to
> allocate from and nearly nothing to clean up".
>
>  Here are my micro-benchmark results related to int[4] allocation where I
> mimic the AAShapePipe.fillParallelogram() method:
>   Patch Ref Gain  5,96 8,27 138,76%  7,31 14,96 204,65%  10,65 20,4
> 191,55%  15,44 29,83 193,20%
> Test environment:
> Linux64 with OpenJDK8 (2 real cpu cores, 4 virtual cpus)
> JVM settings:
> -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -Xms128m  -Xmx128m
>
> Benchmark code (using Peter Levart microbench classes):
> http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/
>
> My conclusion is:  "nothing" > zero (allocation + cleanup) and it is very
> noticeable in multi threading tests.
>
> I advocate that I use a dirty int[4] array (no cleanup) but it is not
> necessary : maybe the performance gain come from that reason.
>
>
> Finally here is the output with  -XX:+PrintTLAB flag:
> TLAB: gc thread: 0x00007f105813d000 [id: 4053] desired_size: 1312KB slow
> allocs: 0  refill waste: 20992B alloc: 1,00000    65600KB refills: 20
> waste  1,2% gc: 323712B slow: 600B fast: 0B
> TLAB: gc thread: 0x00007f105813a800 [id: 4052] desired_size: 1312KB slow
> allocs: 0  refill waste: 20992B alloc: 1,00000    65600KB refills: 7 waste
> 7,9% gc: 745568B slow: 176B fast: 0B
> TLAB: gc thread: 0x00007f1058138800 [id: 4051] desired_size: 1312KB slow
> allocs: 0  refill waste: 20992B alloc: 1,00000    65600KB refills: 15
> waste  3,1% gc: 618464B slow: 448B fast: 0B
> TLAB: gc thread: 0x00007f1058136800 [id: 4050] desired_size: 1312KB slow
> allocs: 0  refill waste: 20992B alloc: 1,00000    65600KB refills: 7 waste
> 0,0% gc: 0B slow: 232B fast: 0B
> TLAB: gc thread: 0x00007f1058009000 [id: 4037] desired_size: 1312KB slow
> allocs: 0  refill waste: 20992B alloc: 1,00000    65600KB refills: 1 waste
> 27,5% gc: 369088B slow: 0B fast: 0B
> TLAB totals: thrds: 5  refills: 50 max: 20 slow allocs: 0 max 0 waste:
> 3,1% gc: 2056832B max: 745568B slow: 1456B max: 600B fast: 0B max: 0B
>
> I would have expected that TLAB can recycle all useless int[4] arrays as
> fast as possible => waste = 100% ???
>
> *Is there any bug in TLAB (core-libs) ?
> Should I send such issue to hotspot team ?
> *
>
> *Test using ThreadLocal AAShapePipeContext:*
> {
>     AAShapePipeContext ctx = getThreadContext();
>     int abox[] = ctx.abox;
>
>     // use array:
>     // mimic: AATileGenerator aatg = renderengine.getAATileGenerator(x, y,
> dx1, dy1, dx2, dy2, 0, 0, clip, abox);
>     abox[0] = 7;
>     abox[1] = 11;
>     abox[2] = 13;
>     abox[3] = 17;
>
>     // mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg, abox);
>     devNull1.yield(abox);
>
>     if (!useThreadLocal) {
>         restoreContext(ctx);
>     }
> }
>
> -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
> -XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal
> -XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC
> >> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24]
> #-------------------------------------------------------------
> # ContextGetInt4: run duration: 10 000 ms
> #
> # Warm up:
> #           4 threads, Tavg =     13,84 ns/op (σ =   0,23 ns/op), Total
> ops =   2889056179 [    13,93 (717199825),     13,87 (720665624),     13,48
> (741390545),     14,09 (709800185)]
> #           4 threads, Tavg =     14,25 ns/op (σ =   0,57 ns/op), Total
> ops =   2811615084 [    15,21 (658351236),     14,18 (706254551),     13,94
> (718202949),     13,74 (728806348)]
> cleanup (explicit Full GC) ...
> cleanup done.
> # Measure:
> *1 threads, Tavg =      5,96 ns/op (σ =   0,00 ns/op), Total ops =
> 1678357614 [     5,96 (1678357614)]
> 2 threads, Tavg =      7,33 ns/op (σ =   0,03 ns/op), Total ops =
> 2729723450 [     7,31 (1369694121),      7,36 (1360029329)]
> 3 threads, Tavg =     10,65 ns/op (σ =   2,73 ns/op), Total ops =
> 2817154340 [    13,24 (755190111),     13,23 (755920429),      7,66
> (1306043800)]
> **4 threads, Tavg =     15,44 ns/op (σ =   3,33 ns/op), Total ops =
> 2589897733 [    17,05 (586353618),     19,23 (519345153),     17,88
> (559401974),     10,81 *(924796988)]
> #
> << JVM END
>
> *Test using standard int[4] allocation:*
> {
>     int abox[] = new int[4];
>
>       // use array:
>     // mimic: AATileGenerator aatg = renderengine.getAATileGenerator(x, y,
> dx1, dy1, dx2, dy2, 0, 0, clip, abox);
>     abox[0] = 7;
>     abox[1] = 11;
>     abox[2] = 13;
>     abox[3] = 17;
>
>     // mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg, abox);
>     devNull1.yield(abox);
> }
>
> -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
> -XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal
> -XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC
> >> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24]
> #-------------------------------------------------------------
> # GetInt4: run duration: 10 000 ms
> #
> # Warm up:
> #           4 threads, Tavg =     31,07 ns/op (σ =   0,60 ns/op), Total
> ops =   1287292142 [    30,26 (330475567),     31,92 (313328449),     31,27
> (319805520),     30,89 (323682606)]
> #           4 threads, Tavg =     30,94 ns/op (σ =   0,33 ns/op), Total
> ops =   1293000783 [    30,92 (323382193),     30,61 (326730340),     31,48
> (317621402),     30,74 (325266848)]
> cleanup (explicit Full GC) ...
> cleanup done.
> # Measure:
> *1 threads, Tavg =      8,27 ns/op (σ =   0,00 ns/op), Total ops =
> 1209213909 [     8,27 (1209213909)]
> 2 threads, Tavg =     14,96 ns/op (σ =   0,04 ns/op), Total ops =
> 1337024734 [    15,00 (666659967),     14,92 (670364767)]
> 3 threads, Tavg =     20,40 ns/op (σ =   1,03 ns/op), Total ops =
> 1470560922 [    21,21 (471592958),     19,00 (526302911),     21,16
> (472665053)]
> **4 threads, Tavg =     29,83 ns/op (σ =   1,82 ns/op), Total ops =
> 1340065128 [    31,17 (320806983),     31,58 (316358130),     26,94
> (370806790),     30,11 *(332093225)]
> #
> << JVM END
>
> Best regards,
> Laurent
>
>
>