From bourges.laurent at gmail.com Thu Sep 10 22:53:46 2015 From: bourges.laurent at gmail.com (=?UTF-8?Q?Laurent_Bourg=C3=A8s?=) Date: Fri, 11 Sep 2015 00:53:46 +0200 Subject: [OpenJDK Rasterizer] Marlin #4 Message-ID: Jim, Here is the first webrev improving copyAARow() on large shapes (pixel loops): http://cr.openjdk.java.net/~lbourges/marlin/marlin-s4.0/ Note: I also incorporated few changes related to force cleanup in case of runtime exception happening within pathTo(): see MarlinRenderingEngine, Stroker, Dasher. I advocate it is not yet completly ready (cleanup, log statement) but I wanted to show the new algorithm & variants: copyAARow uses now 4 variants: - RLE encoding or uncompress alpha values - Both can use block flags to only process small touched pixel blocks (like tiles but only 1D) that boosts simple but large shapes ! To compare JDK8 vs OpenJDK9 performance, I added several properties to help me testing the different combinations. Please give me your first comments (overview). I tested them with my regression tests and all variants are now OK. Here are few results on my machine: Common Settings below: -Dsun.java2d.renderer.*enableRLE=true* -Dsun.java2d.renderer.forceRLE=false -Dsun.java2d.renderer.forceNoRLE=false -Dsun.java2d.renderer.*useTileFlags=true* -Dsun.java2d.renderer.useTileFlags.onlyRLE=true -Dsun.java2d.renderer.useTileFlags.*useHeuristics=false* *JDK1.8-60:* Test Threads Ops Med Pct95 Avg StdDev Min Max TotalOps [ms/op] CircleTests.ser 1 162 64.864 65.197 64.911 0.429 64.513 69.491 162 *EllipseTests-fill-false.ser 1 33 317.290 319.177 317.712 1.343 317.125 324.763 33EllipseTests-fill-true.ser 1 25 451.659 452.124 451.684 0.292 451.229 452.627 25*dc_boulder_2013-13-30-06-13-17.ser 1 114 92.479 92.893 92.450 0.293 91.707 93.299 114 dc_boulder_2013-13-30-06-13-20.ser 1 220 47.785 48.701 47.822 0.528 47.154 51.906 220 dc_shp_alllayers_2013-00-30-07-00-43.ser 1 256 40.947 41.317 40.918 0.482 40.344 47.078 256 dc_shp_alllayers_2013-00-30-07-00-47.ser 1 25 785.488 787.065 785.636 1.351 783.761 791.016 25 dc_spearfish_2013-11-30-06-11-15.ser 1 811 12.966 13.035 12.980 0.163 12.939 17.548 811 dc_spearfish_2013-11-30-06-11-19.ser 1 1607 6.541 6.608 6.552 0.202 6.524 14.592 1607 dc_topp:states_2013-11-30-06-11-06.ser 1 849 12.300 12.376 12.312 0.037 12.277 12.581 849 dc_topp:states_2013-11-30-06-11-07.ser 1 1180 7.474 7.644 7.509 0.062 7.451 7.855 1180 spiralTest-dash-false.ser 1 25 1250.577 1256.383 1250.851 5.072 1242.596 1266.559 25 test_z_625k.ser 1 64 162.223 163.661 162.438 0.598 161.734 164.103 64 Scores: Tests 13 13 Threads 1 1 Pct95 251.245 251.245 *OpenJDK9:* Test Threads Ops Med Pct95 Avg StdDev Min Max TotalOps [ms/op] CircleTests.ser 1 163 64.128 64.353 64.149 0.290 63.757 67.442 163 *EllipseTests-fill-false.ser 1 35 295.859 296.245 295.924 0.211 295.542 296.503 35EllipseTests-fill-true.ser 1 25 491.937 492.165 491.936 0.193 491.662 492.591 25*dc_boulder_2013-13-30-06-13-17.ser 1 114 92.035 92.524 92.151 0.777 91.704 100.023 114 dc_boulder_2013-13-30-06-13-20.ser 1 219 47.851 48.228 47.893 0.202 47.447 48.767 219 dc_shp_alllayers_2013-00-30-07-00-43.ser 1 255 41.116 41.343 41.134 0.481 40.738 48.298 255 dc_shp_alllayers_2013-00-30-07-00-47.ser 1 25 800.950 802.572 800.765 1.206 797.561 803.335 25 dc_spearfish_2013-11-30-06-11-15.ser 1 801 13.130 13.262 13.149 0.055 13.105 13.592 801 dc_spearfish_2013-11-30-06-11-19.ser 1 1583 6.635 6.649 6.643 0.141 6.618 12.165 1583 dc_topp:states_2013-11-30-06-11-06.ser 1 845 12.452 12.569 12.472 0.042 12.428 12.671 845 dc_topp:states_2013-11-30-06-11-07.ser 1 1398 7.521 7.611 7.543 0.166 7.498 13.569 1398 spiralTest-dash-false.ser 1 25 1256.571 1265.870 1258.259 5.417 1252.840 1277.458 25 test_z_625k.ser 1 64 162.690 163.551 162.766 0.370 162.036 163.908 64 Scores: Tests 13 13 Threads 1 1 Pct95 254.380 254.380 *Best settings on JDK1.8.60:* -Dsun.java2d.renderer.*enableRLE=true* -Dsun.java2d.renderer.forceRLE=false -Dsun.java2d.renderer.forceNoRLE=false -Dsun.java2d.renderer.*useTileFlags=true *-Dsun.java2d.renderer.useTileFlags.*onlyRLE=true* -Dsun.java2d.renderer.useTileFlags.*useHeuristics=true* Test Threads Ops Med Pct95 Avg StdDev Min Max TotalOps [ms/op] CircleTests.ser 1 159 66.061 66.302 66.084 0.170 65.741 67.157 159 *EllipseTests-fill-false.ser 1 35 299.068 299.602 299.116 0.297 298.702 300.086 35EllipseTests-fill-true.ser 1 25 434.568 437.110 434.871 0.875 434.375 437.897 25*dc_boulder_2013-13-30-06-13-17.ser 1 113 92.769 93.373 92.807 0.318 91.836 93.878 113 dc_boulder_2013-13-30-06-13-20.ser 1 218 48.027 48.466 48.074 0.617 47.302 56.302 218 dc_shp_alllayers_2013-00-30-07-00-43.ser 1 258 40.858 41.522 40.944 0.228 40.747 41.948 258 dc_shp_alllayers_2013-00-30-07-00-47.ser 1 25 805.037 810.135 804.379 4.719 795.737 815.035 25 dc_spearfish_2013-11-30-06-11-15.ser 1 799 13.031 13.149 13.053 0.049 12.994 13.297 799 dc_spearfish_2013-11-30-06-11-19.ser 1 1591 6.606 6.686 6.623 0.191 6.585 14.142 1591 dc_topp:states_2013-11-30-06-11-06.ser 1 835 12.584 12.675 12.575 0.075 12.375 12.788 835 dc_topp:states_2013-11-30-06-11-07.ser 1 1366 7.708 7.754 7.691 0.256 7.507 16.861 1366 spiralTest-dash-false.ser 1 25 1282.635 1291.162 1284.327 5.325 1275.512 1301.505 25 test_z_625k.ser 1 65 158.586 162.356 159.298 1.499 158.204 166.762 65 Scores: Tests 13 13 Threads 1 1 Pct95 253.100 253.100 *Ductus on JDK1.8.60:*Test Threads Ops Med Pct95 Avg StdDev Min Max TotalOps [ms/op] CircleTests.ser 1 148 69.971 71.418 70.068 0.719 68.369 72.031 148 *EllipseTests-fill-false.ser 1 35 297.560 299.328 297.480 1.093 295.417 299.590 35EllipseTests-fill-true.ser 1 25 453.612 456.290 453.589 1.813 448.936 456.817 25*dc_boulder_2013-13-30-06-13-17.ser 1 93 112.865 113.419 112.880 0.277 112.377 113.459 93 dc_boulder_2013-13-30-06-13-20.ser 1 183 56.944 57.521 56.987 0.260 56.528 58.187 183 dc_shp_alllayers_2013-00-30-07-00-43.ser 1 220 47.955 48.555 47.975 0.346 47.223 49.203 220 dc_shp_alllayers_2013-00-30-07-00-47.ser 1 25 1056.025 1058.306 1056.215 1.079 1054.813 1058.515 25 dc_spearfish_2013-11-30-06-11-15.ser 1 628 16.798 17.095 16.837 0.125 16.633 17.343 628 dc_spearfish_2013-11-30-06-11-19.ser 1 1354 7.605 7.896 7.663 0.104 7.553 8.217 1354 dc_topp:states_2013-11-30-06-11-06.ser 1 616 16.988 17.097 16.980 0.086 16.737 17.513 616 dc_topp:states_2013-11-30-06-11-07.ser 1 931 11.319 11.397 11.304 0.066 11.052 11.479 931 spiralTest-dash-false.ser 1 25 1391.196 1395.741 1391.765 2.908 1387.115 1400.945 25 test_z_625k.ser 1 50 208.874 209.563 208.850 0.439 206.910 209.900 50 Scores: Tests 13 13 Threads 1 1 Pct95 289.510 289.510 *Marlin 0.7.0 on JDK1.8.60:* Test Threads Ops Med Pct95 Avg StdDev Min Max TotalOps [ms/op] CircleTests.ser 1 158 66.573 66.946 66.610 0.230 66.231 67.986 158 *EllipseTests-fill-false.ser 1 25 518.527 519.683 518.957 1.780 518.350 527.552 25EllipseTests-fill-true.ser 1 25 910.986 911.630 911.034 0.439 910.128 912.556 25*dc_boulder_2013-13-30-06-13-17.ser 1 112 93.333 93.866 93.345 0.286 92.669 94.479 112 dc_boulder_2013-13-30-06-13-20.ser 1 216 48.277 48.668 48.248 0.291 47.472 49.297 216 dc_shp_alllayers_2013-00-30-07-00-43.ser 1 260 40.133 40.861 40.277 0.326 39.963 41.228 260 dc_shp_alllayers_2013-00-30-07-00-47.ser 1 25 809.985 816.815 810.538 3.590 803.921 817.493 25 dc_spearfish_2013-11-30-06-11-15.ser 1 801 13.077 13.184 13.092 0.045 13.051 13.419 801 dc_spearfish_2013-11-30-06-11-19.ser 1 1578 6.657 6.714 6.664 0.026 6.642 6.985 1578 dc_topp:states_2013-11-30-06-11-06.ser 1 832 12.641 12.654 12.631 0.035 12.405 12.663 832 dc_topp:states_2013-11-30-06-11-07.ser 1 1351 7.771 7.782 7.757 0.038 7.573 7.808 1351 spiralTest-dash-false.ser 1 25 1262.443 1274.194 1263.859 4.385 1259.201 1276.358 25 test_z_625k.ser 1 61 169.885 170.539 169.875 0.376 169.155 170.638 61 Scores: Tests 13 13 Threads 1 1 Pct95 306.426 306.426 As you can see, OpenJDK9 seems to be a bit slower on filling huge ellipse (MaskFill changes ?) Anyway, the new patch (using RLE + block flags) provides performance comparable to Ductus. Cheers, Laurent -------------- next part -------------- An HTML attachment was scrubbed... URL: From james.graham at oracle.com Thu Sep 17 23:45:39 2015 From: james.graham at oracle.com (Jim Graham) Date: Thu, 17 Sep 2015 16:45:39 -0700 Subject: [OpenJDK Rasterizer] Marlin #4 In-Reply-To: References: Message-ID: <55FB50A3.8070806@oracle.com> Hi Laurent, Sorry it took me so long to get around to this... MarlinConst.java, line 86 - "+2 explain"? MarlinProperties.java - indentation on && continuations should be 4 spaces and/or line up the operands (as in: return isEnableRLE() && isSomethingElse...; OR return isEnableRLE() && isSomethingElse...; ) Renderer.java - edgeNewSize - why did this use to be long and why downgrade to int now? Renderer.java - tosubpix() - did removing static help? Renderer.java, line 1057,1163 - for the case of producing -1 or 0 from the sign bit, you could use: ((err >> 31) << 1) Renderer.java, line 1288,1350 - don't you need to mark tile for (pix_x+1) in case it is a new tile? Renderer.java, line 1308,1370 - don't you need to mark all tiles from x0 to x1? TransformingPC2D - why make all the inner classes private? I'm wary of private for inner classes because sometimes it forces the compiler to insert internal accessor methods for methods which are "semantically accessible" according to the rules of inner classes, but the standard inter-class access was marked private. I'm still reviewing the new RLE stuff, but wanted to get this much out there for now. A couple of inline comments below... On 9/10/15 3:53 PM, Laurent Bourg?s wrote: > Jim, > > Here is the first webrev improving copyAARow() on large shapes (pixel > loops): > http://cr.openjdk.java.net/~lbourges/marlin/marlin-s4.0/ > > I advocate it is not yet completly ready (cleanup, log statement) but I > wanted to show the new algorithm & variants: > copyAARow uses now 4 variants: > - RLE encoding or uncompress alpha values > - Both can use block flags to only process small touched pixel blocks > (like tiles but only 1D) that boosts simple but large shapes ! Whoops, I guess I reviewed the wrong stuff (all the glue rather than the RLE algorithms themselves - D'oh!). > Please give me your first comments (overview). > I tested them with my regression tests and all variants are now OK. > > > Here are few results on my machine: Unfortunately, the giant tables of numbers came through on my end as just a mish-mosh of digits in confusing columns. Can you summarize or try to express it as a smaller ascii table or a real HTML ? ...jim From bourges.laurent at gmail.com Mon Sep 21 21:15:47 2015 From: bourges.laurent at gmail.com (=?UTF-8?Q?Laurent_Bourg=C3=A8s?=) Date: Mon, 21 Sep 2015 23:15:47 +0200 Subject: [OpenJDK Rasterizer] Marlin #4 In-Reply-To: <55FB50A3.8070806@oracle.com> References: <55FB50A3.8070806@oracle.com> Message-ID: Jim, I would like your point of view on the new algorithms to store alpha values efficiently and some advices on heuristics / metrics to make the adaptive approach more efficient / robust. I hope having some spare time soon to spend on improving this patch... Here are few more explanations that may help you: 1. I figured out RLE encoding is only faster if run lengths are important (high repeats). It corresponds to few crossings but large pixel spans. For small shapes or highly complex ones, it is better to leave alpha values uncompressed. >From this assumption, I adopted an adaptive approach based on a simple heuristics: see MarlinCache.useRLE // Hack to switch RLE encoding ON or OFF: // sparse density: useRLE = (width >= 128) && (((maxy - miny) * width) >= (primCount << 5)); // larger than 64x64 It only enables RLE if the shape width is larger than 128 pixels and also it is not too complex: width > (32 * primitive count) / height Of course, it is very rough as primitive count / height is ~ mean(crossings) ! If you have better idea to determine the approximated crossing density per pixel, please tell me. 2. Moreover, I implemented a new tricky approach = block flags where 1 means the block has crossings, 0 none. It helps to traverse & test less pixels ie only blocks with flag = 1 are really useful (others always are full of zero) It let me use efficient Arrays.fill (noRLE variant) or larger run lengths (RLE variant) as nothing is varying in 0 flagged blocks. Of course other heuristics are needed as it only provides gains if blocks can be skipped (pixel steps > 32) and to reduce the overhead due to flagging blocks in the scanline processing loop. 2.1: endRendering() called once per shape: block flags can be enabled either based on the previous useRLE heuristics (=cache.useRLE) or if width > 64 (at least 2 tiles) so maybe 1 tile may be empty ? // Heuristics for using block flags: if (ENABLE_TILE_FLAGS) { if (ENABLE_TILE_FLAGS_ONLY_RLE) { enableTileFlags = this.cache.useRLE; } else { enableTileFlags = ((pmaxX - pminX) >= 64); // 64 // TODO: check numCrossings (later) ? } } 2.2 endRendering() per scanline (quick test): If enableTileFlags flag is enabled before, at every scanline, I quickly check if the pixel width is really > 64 and few crossings => a better probability to have large steps. // fast condition: useBlkFlags &= (numCrossings <= 10) && (pix_maxX - pix_minX) >= 512; // 64px ie 3 tiles Ideas are welcome to refine such metrics. Quick comments below: > > Sorry it took me so long to get around to this... Hope you have some time soon to help me with that [re|over]view > > MarlinConst.java, line 86 - "+2 explain"? Sorry, cleanup needed. In copyAARowNoRLE_WithTileFlags, I use the trick to fill the rowAAChunk array by 32 multiples (tiles) so SuperWord optimization rocks ! However, I am adding +1 or +2 to indices that may exceed the array length. To be checked. > > MarlinProperties.java - indentation on && continuations should be 4 spaces and/or line up the operands (as in: > return isEnableRLE() && > isSomethingElse...; > OR > return isEnableRLE() > && isSomethingElse...; > ) To be fixed. > > Renderer.java - edgeNewSize - why did this use to be long and why downgrade to int now? To be fixed. However, edgebuckets is storing indices as integer so the offheap array is only still accessible up to 2M even if it can store a lot more. > > Renderer.java - tosubpix() - did removing static help? Not really, sorry. > > Renderer.java, line 1057,1163 - for the case of producing -1 or 0 from the sign bit, you could use: > ((err >> 31) << 1) I tried both solutions, but none is faster. > > Renderer.java, line 1288,1350 - don't you need to mark tile for (pix_x+1) in case it is a new tile? No need, as I always traverse pixels [px0; px1] where px1 = (t << TILE_SIZE_LG) + 1 to ensure including the last pixel. > > Renderer.java, line 1308,1370 - don't you need to mark all tiles from x0 to x1? No wanted: I only want to flag tiles that contain alpha variations: alpha is constant between ]x0+1, x1[. > > TransformingPC2D - why make all the inner classes private? I'm wary of private for inner classes because sometimes it forces the compiler to insert internal accessor methods for methods which are "semantically accessible" according to the rules of inner classes, but the standard inter-class access was marked private. Sorry, to be fixed. > > I'm still reviewing the new RLE stuff, but wanted to get this much out there for now. thanks for the first pass, I left many typos / attempts during my that intensive working session. >> >> Here is the first webrev improving copyAARow() on large shapes (pixel >> loops): >> http://cr.openjdk.java.net/~lbourges/marlin/marlin-s4.0/ >> >> I advocate it is not yet completly ready (cleanup, log statement) but I >> wanted to show the new algorithm & variants: >> copyAARow uses now 4 variants: >> - RLE encoding or uncompress alpha values >> - Both can use block flags to only process small touched pixel blocks >> (like tiles but only 1D) that boosts simple but large shapes ! > > > Whoops, I guess I reviewed the wrong stuff (all the glue rather than the RLE algorithms themselves - D'oh!). No problem, I am looking forward your comments on block traversal / RLE algorithms... > > >> Please give me your first comments (overview). >> I tested them with my regression tests and all variants are now OK. It seems using ONLY the variant [noRLE + tile flags] has still artefacts: to be fixed asap. >> >> >> Here are few results on my machine: > > > Unfortunately, the giant tables of numbers came through on my end as just a mish-mosh of digits in confusing columns. Can you summarize or try to express it as a smaller ascii table or a real HTML
? Here is a summary showing only my ellipse draw / fill tests (radius = 1 to 2000): Marlin 0.7.0 on JDK1.8.60: Test Threads Ops Med *Pct95* Avg StdDev Min Max TotalOps [ms/op] EllipseTests-fill-false.ser 1 25 518.527 *519.683* 518.957 1.780 518.350 527.552 25 EllipseTests-fill-true.ser 1 25 910.986 * 911.630* 911.034 0.439 910.128 912.556 25 New patch Marlin 0.7.1: Best settings on JDK1.8.60: Test Threads Ops Med *Pct95* Avg StdDev Min Max TotalOps [ms/op] EllipseTests-fill-false.ser 1 35 299.068 *299.602 * 299.116 0.297 298.702 300.086 35 EllipseTests-fill-true.ser 1 25 434.568 *437.110* 434.871 0.875 434.375 437.897 25 OpenJDK9: Test Threads Ops Med *Pct95* Avg StdDev Min Max TotalOps [ms/op] EllipseTests-fill-false.ser 1 35 295.859 *296.245* 295.924 0.211 295.542 296.503 35 EllipseTests-fill-true.ser 1 25 491.937 *492.165* 491.936 0.193 491.662 492.591 25 Ductus on JDK1.8.60: Test Threads Ops Med *Pct95* Avg StdDev Min Max TotalOps [ms/op] EllipseTests-fill-false.ser 1 35 297.560 *299.328 * 297.480 1.093 295.417 299.590 35 EllipseTests-fill-true.ser 1 25 453.612 *456.290 * 453.589 1.813 448.936 456.817 25 Conclusion: The new patch seems promising as it is very close to ductus performance. Filling ellipse seems slower on OpenJDK9 (492 / 437 = 12% slower) ! Any MaskFill changes ? Regards, Laurent -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark.reinhold at oracle.com Wed Sep 23 17:10:50 2015 From: mark.reinhold at oracle.com (mark.reinhold at oracle.com) Date: Wed, 23 Sep 2015 10:10:50 -0700 (PDT) Subject: [OpenJDK Rasterizer] JEP 265: Marlin Graphics Renderer Message-ID: <20150923171050.1DB5B7A203@eggemoggin.niobe.net> New JEP Candidate: http://openjdk.java.net/jeps/265 - Mark From james.graham at oracle.com Thu Sep 24 00:46:41 2015 From: james.graham at oracle.com (Jim Graham) Date: Wed, 23 Sep 2015 17:46:41 -0700 Subject: [OpenJDK Rasterizer] Marlin #4 In-Reply-To: References: <55FB50A3.8070806@oracle.com> Message-ID: <560347F1.9030101@oracle.com> Hi Laurent, On 9/21/15 2:15 PM, Laurent Bourg?s wrote: > Here is a summary showing only my ellipse draw / fill tests (radius = 1 > to 2000): As you can see below, the table is still mangled, but due to fewer columns I was able to piece things together. > Marlin 0.7.0 on JDK1.8.60: > Test Threads Ops Med > *Pct95* Avg StdDev Min Max TotalOps [ms/op] > EllipseTests-fill-false.ser 1 25 518.527 > *519.683* 518.957 1.780 518.350 527.552 25 > EllipseTests-fill-true.ser 1 25 910.986 > *911.630* 911.034 0.439 910.128 912.556 25 Is there some reason these runs were 25/25 ops rather than the 35/25 ops for the other 3 runs below? > New patch Marlin 0.7.1: > Best settings on JDK1.8.60: > Test Threads Ops Med > *Pct95* Avg StdDev Min Max TotalOps [ms/op] > EllipseTests-fill-false.ser 1 35 299.068 > *299.602 * 299.116 0.297 298.702 300.086 35 > EllipseTests-fill-true.ser 1 25 434.568 > *437.110* 434.871 0.875 434.375 437.897 25 > > OpenJDK9: > Test Threads Ops Med > *Pct95* Avg StdDev Min Max TotalOps [ms/op] > EllipseTests-fill-false.ser 1 35 295.859 > *296.245* 295.924 0.211 295.542 296.503 35 > EllipseTests-fill-true.ser 1 25 491.937 > *492.165* 491.936 0.193 491.662 492.591 25 > > Ductus on JDK1.8.60: > Test Threads Ops Med *Pct95* Avg StdDev Min Max > TotalOps [ms/op] > EllipseTests-fill-false.ser 1 35 297.560 > *299.328 * 297.480 1.093 295.417 299.590 35 > EllipseTests-fill-true.ser 1 25 453.612 > *456.290 * 453.589 1.813 448.936 456.817 25 > > Conclusion: > The new patch seems promising as it is very close to ductus performance. > Filling ellipse seems slower on OpenJDK9 (492 / 437 = 12% slower) ! Any > MaskFill changes ? Not that I'm aware of. Maybe Hotspot changes? Phil? ...jim From james.graham at oracle.com Thu Sep 24 01:10:56 2015 From: james.graham at oracle.com (Jim Graham) Date: Wed, 23 Sep 2015 18:10:56 -0700 Subject: [OpenJDK Rasterizer] Marlin #4 In-Reply-To: References: <55FB50A3.8070806@oracle.com> Message-ID: <56034DA0.5040001@oracle.com> Some thoughts - we record some info on each scanline - mostly about the new edges that are added. Perhaps we could keep deltas of how many edges come and go per scanline and then sum them up at the start to figure out if any scanline has a lot of crossings? One slight optimization in the non-tileflag version of CopyRLE. for (i, from, to) { int delta; if ((delta = alphaRow[i]) != 0) { cache.add(val); val += delta; // Range Check only needed here? runLen = 1; alphaRow[i] = 0; // Optional - avoids clear later? } else { runLen++; } } It avoids prev and having to add "prev + delta" for the very common case of delta == 0. Also, RLE tends to be more useful if the index of the values is larger than your data storage unit - which is why Pisces used RLE when it was on embedded since they were using bytes to store the alpha caches, but shapes could be larger than 256 units. You seem to be using integers which means there is no run long enough to require having to break it up into multiple segments, you can just store the horizontal index of the next change of value and its value. This also means you don't have to sum up counts to figure out where a partial row starts, you just scan for the first index that is in range (remembering the previous alpha value to be used for the beginning of the range)... ...jim From Sergey.Bylokhov at oracle.com Thu Sep 24 02:17:41 2015 From: Sergey.Bylokhov at oracle.com (Sergey Bylokhov) Date: Thu, 24 Sep 2015 05:17:41 +0300 Subject: [OpenJDK Rasterizer] Marlin #4 In-Reply-To: References: <55FB50A3.8070806@oracle.com> Message-ID: <56035D45.9060909@oracle.com> On 22.09.15 0:15, Laurent Bourg?s wrote: > Conclusion: > The new patch seems promising as it is very close to ductus performance. > Filling ellipse seems slower on OpenJDK9 (492 / 437 = 12% slower) ! Any > MaskFill changes ? For such checks I suggest to use JMH + "prof perfasm". It will provide really good info per java methods(before/after compilation) including assemblers, plus the log include the native methods. Example looks like this: http://cr.openjdk.java.net/~shade/jmh/perfasm-sample.log http://openjdk.java.net/projects/code-tools/jmh It is really good in java2d because sometimes it is unclear where the problem is occurs(java or native or new objects etc), and any java profilers can change the behavior of application. -- Best regards, Sergey. From bourges.laurent at gmail.com Thu Sep 24 14:59:44 2015 From: bourges.laurent at gmail.com (=?UTF-8?Q?Laurent_Bourg=C3=A8s?=) Date: Thu, 24 Sep 2015 16:59:44 +0200 Subject: [OpenJDK Rasterizer] Marlin #4 In-Reply-To: <56035D45.9060909@oracle.com> References: <55FB50A3.8070806@oracle.com> <56035D45.9060909@oracle.com> Message-ID: Sergey, I managed to create a new benchmark with JMH + perfasm profiler: http://cr.openjdk.java.net/~lbourges/jmh/ellipse_fill/ See MyBenchMark.java that fills an ellipse with radius in {"100", "500", "900", "1400"} I tested with both Oracle JDK8 and Oracle JDK9 EA b81 ie using the ductus rendering engine: http://cr.openjdk.java.net/~lbourges/jmh/ellipse_fill/bench_jdk8.log http://cr.openjdk.java.net/~lbourges/jmh/ellipse_fill/bench_jdk9.log JDK8: Benchmark (size) Mode Cnt Score Error Units MyBenchmark.fillEllipse 100 avgt 3 0,207 ? 0,034 ms/op MyBenchmark.fillEllipse 500 avgt 3 1,931 ? 0,112 ms/op MyBenchmark.fillEllipse 900 avgt 3 5,158 ? 0,346 ms/op MyBenchmark.fillEllipse 1400 avgt 3 9,628 ? 1,321 ms/op JDK9: Benchmark (size) Mode Cnt Score Error Units MyBenchmark.fillEllipse 100 avgt 3 0,223 ? 0,005 ms/op MyBenchmark.fillEllipse 500 avgt 3 2,069 ? 0,044 ms/op MyBenchmark.fillEllipse 900 avgt 3 5,393 ? 0,285 ms/op MyBenchmark.fillEllipse 1400 avgt 3 12,305 ? 0,104 ms/op JDK9 is slower ~ 10% in this test. I tried to interpret the profiler info but I just noticed the hotspots are located in native code (libawt.so): JDK8: ....[Hottest Regions]............................................................................... 48,53% 51,78% [0x7f78197f9ae1:0x7f78197f9b27] in IntArgbPreSrcMaskFill (libawt.so) 11,27% 11,68% [0x7f78197f9900:0x7f78197f9aa6] in IntArgbPreSrcMaskFill (libawt.so) 9,91% 11,58% [0x7f7813bc6527:0x7f7813bc65bd] in writeAlpha8 (libdcpr.so) 6,51% 2,73% [0x7f7813bc5471:0x7f7813bc560a] in processJumpBuffer; processSubBufferInTile (libdcpr.so) 2,13% 2,16% [0x7f7813bc6436:0x7f7813bc6506] in writeAlpha8 (libdcpr.so) JDK9: ...[Hottest Regions]............................................................................... 61,90% 66,72% [0x7f71ae7f5678:0x7f71ae7f5837] in IntArgbPreSrcMaskFill (libawt.so) 10,06% 5,40% [0x7f71acb0aa77:0x7f71acb0afa9] in processJumpBuffer; processSubBufferInTile; reset.isra.4 (libdcpr.so) 9,23% 10,45% [0x7f71acb0bb68:0x7f71acb0bc7d] in writeAlpha8 (libdcpr.so) So this test is using the software pixel loop [IntArgbPreSrcMaskFill]. I looked at the source code and compared the libawt / java2d / loops / vis_IntArgbPre_Mask.c from openjdk8 and openjdk9 but those are the same ! Can it be a JNI issue or a compilation issue (gcc settings ...) with that native code ? Any idea, Sergey ? Thanks for the tips, Laurent 2015-09-24 4:17 GMT+02:00 Sergey Bylokhov : > On 22.09.15 0:15, Laurent Bourg?s wrote: > > Conclusion: >> The new patch seems promising as it is very close to ductus performance. >> Filling ellipse seems slower on OpenJDK9 (492 / 437 = 12% slower) ! Any >> MaskFill changes ? >> > > For such checks I suggest to use JMH + "prof perfasm". It will provide > really good info per java methods(before/after compilation) including > assemblers, plus the log include the native methods. > Example looks like this: > http://cr.openjdk.java.net/~shade/jmh/perfasm-sample.log > > http://openjdk.java.net/projects/code-tools/jmh > > It is really good in java2d because sometimes it is unclear where the > problem is occurs(java or native or new objects etc), and any java > profilers can change the behavior of application. > > -- > Best regards, Sergey. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From james.graham at oracle.com Thu Sep 24 17:26:34 2015 From: james.graham at oracle.com (Jim Graham) Date: Thu, 24 Sep 2015 10:26:34 -0700 Subject: [OpenJDK Rasterizer] Marlin #4 In-Reply-To: References: <55FB50A3.8070806@oracle.com> <56035D45.9060909@oracle.com> Message-ID: <5604324A.3070307@oracle.com> Hi Laurent, You are looking at the wrong loop. It's tough to explain... vis_*.c are only ever compiled or used on Solaris. They convince the compiler to emit Sparc's version of MMX instructions. They are not even compiled on any other build except for Solaris. You were probably confused because they look like the implementations of the functions you were looking for and you never saw any other implementation of that function. That's because all of the software loops are actually constructed using a very complicated system of Macros. If you look at loops/IntArgbPre.c you will see a bunch of macro calls at the top which expand to declaring the functions such as "IntArgbPreSrcMaskFill". Then you will see a structure with a bunch of Macro invocations in it which expand to declaring a structure describing the loops, one per loop function. Then you will see a bunch more macro invocations, one per line, which surprisingly expand to entire functions for each one of them. You'd have to do some serious tracing of macros to see what the code looks like, but most of the macros expand from either IntArgb.h or LoopMacros.h... ...jim On 9/24/15 7:59 AM, Laurent Bourg?s wrote: > Sergey, > > I managed to create a new benchmark with JMH + perfasm profiler: > http://cr.openjdk.java.net/~lbourges/jmh/ellipse_fill/ > > See MyBenchMark.java that fills an ellipse with radius in {"100", "500", > "900", "1400"} > > I tested with both Oracle JDK8 and Oracle JDK9 EA b81 ie using the > ductus rendering engine: > http://cr.openjdk.java.net/~lbourges/jmh/ellipse_fill/bench_jdk8.log > http://cr.openjdk.java.net/~lbourges/jmh/ellipse_fill/bench_jdk9.log > > JDK8: > Benchmark (size) Mode Cnt Score Error Units > MyBenchmark.fillEllipse 100 avgt 3 0,207 ? 0,034 ms/op > MyBenchmark.fillEllipse 500 avgt 3 1,931 ? 0,112 ms/op > MyBenchmark.fillEllipse 900 avgt 3 5,158 ? 0,346 ms/op > MyBenchmark.fillEllipse 1400 avgt 3 9,628 ? 1,321 ms/op > > JDK9: > Benchmark (size) Mode Cnt Score Error Units > MyBenchmark.fillEllipse 100 avgt 3 0,223 ? 0,005 ms/op > MyBenchmark.fillEllipse 500 avgt 3 2,069 ? 0,044 ms/op > MyBenchmark.fillEllipse 900 avgt 3 5,393 ? 0,285 ms/op > MyBenchmark.fillEllipse 1400 avgt 3 12,305 ? 0,104 ms/op > > JDK9 is slower ~ 10% in this test. > > > I tried to interpret the profiler info but I just noticed the hotspots > are located in native code (libawt.so): > > JDK8: > > ....[Hottest Regions]............................................................................... > 48,53% 51,78% [0x7f78197f9ae1:0x7f78197f9b27] in IntArgbPreSrcMaskFill (libawt.so) > 11,27% 11,68% [0x7f78197f9900:0x7f78197f9aa6] in IntArgbPreSrcMaskFill (libawt.so) > 9,91% 11,58% [0x7f7813bc6527:0x7f7813bc65bd] in writeAlpha8 (libdcpr.so) > 6,51% 2,73% [0x7f7813bc5471:0x7f7813bc560a] in processJumpBuffer; processSubBufferInTile (libdcpr.so) > 2,13% 2,16% [0x7f7813bc6436:0x7f7813bc6506] in writeAlpha8 (libdcpr.so) > > > JDK9: > ...[Hottest > Regions]............................................................................... > 61,90% 66,72% [0x7f71ae7f5678:0x7f71ae7f5837] in > IntArgbPreSrcMaskFill (libawt.so) > 10,06% 5,40% [0x7f71acb0aa77:0x7f71acb0afa9] in processJumpBuffer; > processSubBufferInTile; reset.isra.4 (libdcpr.so) > 9,23% 10,45% [0x7f71acb0bb68:0x7f71acb0bc7d] in writeAlpha8 > (libdcpr.so) > > So this test is using the software pixel loop [IntArgbPreSrcMaskFill]. > > I looked at the source code and compared the libawt / java2d / loops / > vis_IntArgbPre_Mask.c from openjdk8 and openjdk9 but those are the same ! > > Can it be a JNI issue or a compilation issue (gcc settings ...) with > that native code ? > > Any idea, Sergey ? > > Thanks for the tips, > Laurent > > 2015-09-24 4:17 GMT+02:00 Sergey Bylokhov >: > > On 22.09.15 0:15, Laurent Bourg?s wrote: > > Conclusion: > The new patch seems promising as it is very close to ductus > performance. > Filling ellipse seems slower on OpenJDK9 (492 / 437 = 12% > slower) ! Any > MaskFill changes ? > > > For such checks I suggest to use JMH + "prof perfasm". It will > provide really good info per java methods(before/after compilation) > including assemblers, plus the log include the native methods. > Example looks like this: > http://cr.openjdk.java.net/~shade/jmh/perfasm-sample.log > > http://openjdk.java.net/projects/code-tools/jmh > > It is really good in java2d because sometimes it is unclear where > the problem is occurs(java or native or new objects etc), and any > java profilers can change the behavior of application. > > -- > Best regards, Sergey. > > From james.graham at oracle.com Thu Sep 24 17:28:44 2015 From: james.graham at oracle.com (Jim Graham) Date: Thu, 24 Sep 2015 10:28:44 -0700 Subject: [OpenJDK Rasterizer] Marlin #4 In-Reply-To: References: <55FB50A3.8070806@oracle.com> <56035D45.9060909@oracle.com> Message-ID: <560432CC.7060904@oracle.com> As far as why the software loops are slower... Did any command line options change for compiling IntArgbPre.c? Touch the file and rebuild and verify if the compiler options are the same (and that both builds use the same compiler)... ...jim From bourges.laurent at gmail.com Thu Sep 24 19:14:09 2015 From: bourges.laurent at gmail.com (=?UTF-8?Q?Laurent_Bourg=C3=A8s?=) Date: Thu, 24 Sep 2015 21:14:09 +0200 Subject: [OpenJDK Rasterizer] Marlin #4 In-Reply-To: <560432CC.7060904@oracle.com> References: <55FB50A3.8070806@oracle.com> <56035D45.9060909@oracle.com> <560432CC.7060904@oracle.com> Message-ID: Jim, > As far as why the software loops are slower... > > Did any command line options change for compiling IntArgbPre.c? Touch the file and rebuild and verify if the compiler options are the same (and that both builds use the same compiler)... To avoid all possible side effects, I deliberately tested Oracle Jdk (ea binary builds). I hope both were compiled with the same gcc & options. You're right: I guessed there was c macros, so I compared both java2d/folders too (including header files). I can try compiling both openjdk 8 & 9 sources on my laptop but it may take some time to test again. PS: I think it is better for me to focus on improving patch 4. Laurent -------------- next part -------------- An HTML attachment was scrubbed... URL: