[OpenJDK 2D-Dev] sun.java2D.Pisces renderer Performance and Memory enhancements

Fri May 10 23:13:11 UTC 2013

Hi Laurent,

On 5/9/13 11:50 PM, Laurent Bourgès wrote:
> Jim,
>
> I think that the ScanLineIterator class is no more useful and could be
> merged into Renderer directly: I try to optimize these 2 code paths
> (crossing / crossing -> alpha) but it seems quite difficult as I must
> understand hotspot optimizations (assembler code)...

I was originally skeptical about breaking that part of the process into 
an iterator class as well.

> For now I want to keep pisces in Java code as hotspot is efficient
> enough and probably the algorithm can be reworked a bit;
> few questions:
> - should edges be sorted by Ymax ONCE to avoid complete edges traversal
> to count crossings for each Y value:
>
>   156             if ((bucketcount & 0x1) != 0) {
>   157                 int newCount = 0;
>   158                 for (int i = 0, ecur; i < count; i++) {
>   159                     ecur = ptrs[i];
> *  160                     if (_edgesInt[ecur + YMAX] > cury) {
> *  161                         ptrs[newCount++] = ecur;
>   162                     }
>   163                 }
>   164                 count = newCount;
>   165             }

This does not traverse all edges, just the edges currently "in play" and 
it only does it for scanlines that had a recorded ymax on them (count is 
multiplied by 2 and then optionally the last bit is set if some edge 
ends on that scanline so we know whether or not to do the "remove 
expired edges" processing).

> - why multiply x2 and divide /2 the crossings (+ rounding issues) ?
>
>   202             for (int i = 0, ecur, j; i < count; i++) {
>   203                 ecur = ptrs[i];
>   204                 curx = _edges[ecur /* + CURX */];
>   205                 _edges[ecur /* + CURX */] = curx + _edges[ecur + SLOPE];
>   206
> *  207                 cross = ((int) curx) << 1;
> *  208                 if (_edgesInt[ecur + OR] != 0 /* > 0 */) {
>   209                     cross |= 1;

The line above sets the bottom bit if the crossing is one orientation 
vs. the other so we know whether to add one or subtract one to the 
winding count.  The crossings can then be sorted and the orientation 
flag is carried along with the values as they are sorted.  The cost of 
this trick is having to shift the actual crossing coordinates by 1 to 
vacate the LSBit.

> - last x pixel processing: could you explain me ?
>   712                                 int pix_xmax = x1 >> SUBPIXEL_LG_POSITIONS_X;
>   713                                 int tmp = (x0 & SUBPIXEL_MASK_X);
>   714                                 alpha[pix_x] += SUBPIXEL_POSITIONS_X - tmp;
>   715                                 alpha[pix_x + 1] += tmp;
>   716                                 tmp = (x1 & SUBPIXEL_MASK_X);
>   717                                 alpha[pix_xmax] -= SUBPIXEL_POSITIONS_X - tmp;
>   718                                 alpha[pix_xmax + 1] -= tmp;

Are you referring to the 2 += and the 2 -= for each end of the span?  If 
an edge crosses in a given pixel at 5 subpixel positions after the start 
of that pixel, then it contributes a coverage of "SUBPIXEL_POS_X minus 
5" in that pixel.  But, starting with the following pixel, the total 
coverage it adds for those pixels until it reaches the right edge of the 
span is "SUBPIXEL_POSITIONS_X".  However, we are recording deltas and 
the previous pixels only bumped our total coverage by "S_P_X - 5".  So, 
we now need to bump the accumulated coverage by 5 in the following pixel 
so that the total added coverage is "S_P_X".

Basically the pair of += lines adds a total S_P_X to the coverage, but 
it splits that sum over two pixels - the one where the left edge first 
appeared and its following pixel.  Similarly, the two -= statements 
subtract a total of S_P_X from the coverage total, and do so spread 
across 2 pixels.  If the crossing happened right at the left edge of the 
pixel then tmp would be 0 and the second += or -= would be wasted, but 
that only happens 1 out of S_P_X times and the cost of testing is 
probably less than just adding tmp to the second pixel even if it is 0.

Also note that we need to have an array entry for alpha[max_x + 1] so 
that the second += and -= don't store off the end of the array.  We 
don't need to use that value since we will stop our alpha accumulations 
at the entry for max_x, but testing to see if the "second pixel delta 
value" is needed is more expensive than just accumulating it into an 
unused array entry.

> Finally, it seems that hotspot settings (CompileThreshold=1000 and
> -XX:aggressiveopts) are able to compile theses hotspots better ...

What about if we use the default settings as would most non-server apps?

> Thanks; probably the edgeBucket / edgeBucketCount arrays could be merged
> into a single one to improve cache affinity.

Interesting.

> FYI, I can write C/C++ code but I never practised JNI code.
> Does somebody could help us to port only these 2 hotspot methods ?

Port 2 Hotspot methods?  I'm not sure what you are referring to here?

> PS: I attend a conference next week (germany) so I will be less
> available to work on code but I will read my emails.

Thanks for the heads up - as long as you don't time out and switch off 
of this project permanently - that would be a shame...  :(

			...jim