[OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste
Peter Levart
peter.levart at gmail.com
Wed Apr 10 10:17:58 UTC 2013
Hi Laurent,
Could you disable tiered compilation for performance tests? Tiered
compilation is usually a source of jitter in the results. Pass
-XX:-TieredCompilation to the VM.
Regards, Peter
On 04/10/2013 10:58 AM, Laurent Bourgès wrote:
> Dear Jim,
>
> 2013/4/9 Jim Graham <james.graham at oracle.com
> <mailto:james.graham at oracle.com>>
>
>
> The allocations will always show up on a heap profiler, I don't
> know of any way of having them not show up if they are stack
> allocated, but I don't think that stack allocation is the issue
> here - small allocations come out of a fast generation that costs
> almost nothing to allocate from and nearly nothing to clean up.
> They are actually getting allocated and GC'd, but the process is
> optimized.
>
> The only way to tell is to benchmark and see which changes make a
> difference and which are in the noise (or, in some odd
> counter-intuitive cases, counter-productive)...
>
> ...jim
>
>
> I advocate I like GC because it avoids in Java dealing with pointers
> like C/C++ does; however, I prefer GC clean real garbage
> (application...) than wasted memory:
> I prefer not count on GC when I can avoid wasting memory that gives GC
> more work = reduce useless garbage (save the planet) !
>
> Moreover, GC and / or Thread local allocation (TLAB) seems to have
> more overhead than you think = "fast generation that costs almost
> nothing to allocate from and nearly nothing to clean up".
>
> Here are my micro-benchmark results related to int[4] allocation where
> I mimic the AAShapePipe.fillParallelogram() method:
> Patch Ref Gain
> 5,96 8,27 138,76%
> 7,31 14,96 204,65%
> 10,65 20,4 191,55%
> 15,44 29,83 193,20%
>
>
> Test environment:
> Linux64 with OpenJDK8 (2 real cpu cores, 4 virtual cpus)
> JVM settings:
> -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -Xms128m -Xmx128m
>
> Benchmark code (using Peter Levart microbench classes):
> http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/
> <http://jmmc.fr/%7Ebourgesl/share/AAShapePipe/microbench/>
>
> My conclusion is: "nothing" > zero (allocation + cleanup) and it is
> very noticeable in multi threading tests.
>
> I advocate that I use a dirty int[4] array (no cleanup) but it is not
> necessary : maybe the performance gain come from that reason.
>
>
> Finally here is the output with -XX:+PrintTLAB flag:
> TLAB: gc thread: 0x00007f105813d000 [id: 4053] desired_size: 1312KB
> slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills:
> 20 waste 1,2% gc: 323712B slow: 600B fast: 0B
> TLAB: gc thread: 0x00007f105813a800 [id: 4052] desired_size: 1312KB
> slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 7
> waste 7,9% gc: 745568B slow: 176B fast: 0B
> TLAB: gc thread: 0x00007f1058138800 [id: 4051] desired_size: 1312KB
> slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills:
> 15 waste 3,1% gc: 618464B slow: 448B fast: 0B
> TLAB: gc thread: 0x00007f1058136800 [id: 4050] desired_size: 1312KB
> slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 7
> waste 0,0% gc: 0B slow: 232B fast: 0B
> TLAB: gc thread: 0x00007f1058009000 [id: 4037] desired_size: 1312KB
> slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 1
> waste 27,5% gc: 369088B slow: 0B fast: 0B
> TLAB totals: thrds: 5 refills: 50 max: 20 slow allocs: 0 max 0
> waste: 3,1% gc: 2056832B max: 745568B slow: 1456B max: 600B fast: 0B
> max: 0B
>
> I would have expected that TLAB can recycle all useless int[4] arrays
> as fast as possible => waste = 100% ???
>
> *Is there any bug in TLAB (core-libs) ?
> Should I send such issue to hotspot team ?
> *
>
> *Test using ThreadLocal AAShapePipeContext:*
> {
> AAShapePipeContext ctx = getThreadContext();
> int abox[] = ctx.abox;
>
> // use array:
> // mimic: AATileGenerator aatg =
> renderengine.getAATileGenerator(x, y, dx1, dy1, dx2, dy2, 0, 0, clip,
> abox);
> abox[0] = 7;
> abox[1] = 11;
> abox[2] = 13;
> abox[3] = 17;
>
> // mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg,
> abox);
> devNull1.yield(abox);
>
> if (!useThreadLocal) {
> restoreContext(ctx);
> }
> }
>
> -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
> -XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags
> -XX:-PrintFlagsFinal -XX:+UseCompressedKlassPointers
> -XX:+UseCompressedOops -XX:+UseParallelGC
> >> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24]
> #-------------------------------------------------------------
> # ContextGetInt4: run duration: 10 000 ms
> #
> # Warm up:
> # 4 threads, Tavg = 13,84 ns/op (σ = 0,23 ns/op),
> Total ops = 2889056179 [ 13,93 (717199825), 13,87
> (720665624), 13,48 (741390545), 14,09 (709800185)]
> # 4 threads, Tavg = 14,25 ns/op (σ = 0,57 ns/op),
> Total ops = 2811615084 [ 15,21 (658351236), 14,18
> (706254551), 13,94 (718202949), 13,74 (728806348)]
> cleanup (explicit Full GC) ...
> cleanup done.
> # Measure:
> *1 threads, Tavg = 5,96 ns/op (σ = 0,00 ns/op), Total ops =
> 1678357614 [ 5,96 (1678357614)]
> 2 threads, Tavg = 7,33 ns/op (σ = 0,03 ns/op), Total ops =
> 2729723450 [ 7,31 (1369694121), 7,36 (1360029329)]
> 3 threads, Tavg = 10,65 ns/op (σ = 2,73 ns/op), Total ops =
> 2817154340 [ 13,24 (755190111), 13,23 (755920429), 7,66
> (1306043800)]
> **4 threads, Tavg = 15,44 ns/op (σ = 3,33 ns/op), Total ops =
> 2589897733 [ 17,05 (586353618), 19,23 (519345153), 17,88
> (559401974), 10,81 *(924796988)]
> #
> << JVM END
>
> *Test using standard int[4] allocation:*
> {
> int abox[] = new int[4];
>
> // use array:
> // mimic: AATileGenerator aatg =
> renderengine.getAATileGenerator(x, y, dx1, dy1, dx2, dy2, 0, 0, clip,
> abox);
> abox[0] = 7;
> abox[1] = 11;
> abox[2] = 13;
> abox[3] = 17;
>
> // mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg,
> abox);
> devNull1.yield(abox);
> }
>
> -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
> -XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags
> -XX:-PrintFlagsFinal -XX:+UseCompressedKlassPointers
> -XX:+UseCompressedOops -XX:+UseParallelGC
> >> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24]
> #-------------------------------------------------------------
> # GetInt4: run duration: 10 000 ms
> #
> # Warm up:
> # 4 threads, Tavg = 31,07 ns/op (σ = 0,60 ns/op),
> Total ops = 1287292142 [ 30,26 (330475567), 31,92
> (313328449), 31,27 (319805520), 30,89 (323682606)]
> # 4 threads, Tavg = 30,94 ns/op (σ = 0,33 ns/op),
> Total ops = 1293000783 [ 30,92 (323382193), 30,61
> (326730340), 31,48 (317621402), 30,74 (325266848)]
> cleanup (explicit Full GC) ...
> cleanup done.
> # Measure:
> *1 threads, Tavg = 8,27 ns/op (σ = 0,00 ns/op), Total ops =
> 1209213909 [ 8,27 (1209213909)]
> 2 threads, Tavg = 14,96 ns/op (σ = 0,04 ns/op), Total ops =
> 1337024734 [ 15,00 (666659967), 14,92 (670364767)]
> 3 threads, Tavg = 20,40 ns/op (σ = 1,03 ns/op), Total ops =
> 1470560922 [ 21,21 (471592958), 19,00 (526302911), 21,16
> (472665053)]
> **4 threads, Tavg = 29,83 ns/op (σ = 1,82 ns/op), Total ops =
> 1340065128 [ 31,17 (320806983), 31,58 (316358130), 26,94
> (370806790), 30,11 *(332093225)]
> #
> << JVM END
>
> Best regards,
> Laurent
>
More information about the core-libs-dev
mailing list