[OpenJDK 2D-Dev] sun.java2D.Pisces renderer Performance and Memory enhancements

Wed May 29 09:01:45 UTC 2013

Andrea, Jim and java2d members,

I am going back to my work on java2d pisces improvements (scanline
rendering optimization).

Two comments:
- hotspot takes some time to optimize the very large
Renderer.endRendering() and ScalineIterator.next() methods: probably
because these methods represents too many byte codes (many loops ...)
- I tried merging them but it seems that it becomes slower (too big method
? or hotspot produce less optimized code according to its heuristics ...
warmup issue ?)
- Do you think it could help hotspot if such methods are split in smaller
ones ?
- Many float operations are used to compute crossings that could be
replaced by DDA (integer ops) to improve performance ...
- does anybody have other ideas to improve algorithm efficiency like
winding rule handling (<< 1, >> 1, bit mask), edge eviction from active
edge list, float rounding operations) ?

Moreover, could anybody help me working on pisces (code, test, benchmarks)
?
at least review the last patch ?

Remaining items:
- clipping issues (dasher emits many segments even out of bounds)
- rounding issues: float to int conversions (biais or not)

Finally I propose to focus on optimizing these 2 algorithms (methods) as it
represents 55% of cpu time:
- optimize / modify algorithms: integer instead of float ops, reduce number
of condition statements (if) to improve branch prediction ...
- or / and port these methods into C code (JNI)

Here are my comments:

>> I think that the ScanLineIterator class is no more useful and could be
>> merged into Renderer directly: I try to optimize these 2 code paths
>> (crossing / crossing -> alpha) but it seems quite difficult as I must
>> understand hotspot optimizations (assembler code)...
>>
>
> I was originally skeptical about breaking that part of the process into an
> iterator class as well.
>

I tried to merge these methods into Renderer.endRendering() but benchmark
results seems worse: probably hotspot generates less optimized code ...

>  For now I want to keep pisces in Java code as hotspot is efficient
>> enough and probably the algorithm can be reworked a bit;
>> few questions:
>> - should edges be sorted by Ymax ONCE to avoid complete edges traversal
>> to count crossings for each Y value:
>>
>>   156             if ((bucketcount & 0x1) != 0) {
>>   157                 int newCount = 0;
>>   158                 for (int i = 0, ecur; i < count; i++) {
>>   159                     ecur = ptrs[i];
>> *  160                     if (_edgesInt[ecur + YMAX] > cury) {
>> *  161                         ptrs[newCount++] = ecur;
>>
>>   162                     }
>>   163                 }
>>   164                 count = newCount;
>>   165             }
>>
>
> This does not traverse all edges, just the edges currently "in play" and
> it only does it for scanlines that had a recorded ymax on them (count is
> multiplied by 2 and then optionally the last bit is set if some edge ends
> on that scanline so we know whether or not to do the "remove expired edges"
> processing).
>

Agreed, this algorithm is the "well know" edge eviction from active edge
list => move it into dedicated method ?

>  - why multiply x2 and divide /2 the crossings (+ rounding issues) ?
>>
>>   202             for (int i = 0, ecur, j; i < count; i++) {
>>   203                 ecur = ptrs[i];
>>   204                 curx = _edges[ecur /* + CURX */];
>>   205                 _edges[ecur /* + CURX */] = curx + _edges[ecur +
>> SLOPE];
>>   206
>> *  207                 cross = ((int) curx) << 1;
>> *  208                 if (_edgesInt[ecur + OR] != 0 /* > 0 */) {
>>   209                     cross |= 1;
>>
>
> The line above sets the bottom bit if the crossing is one orientation vs.
> the other so we know whether to add one or subtract one to the winding
> count.  The crossings can then be sorted and the orientation flag is
> carried along with the values as they are sorted.  The cost of this trick
> is having to shift the actual crossing coordinates by 1 to vacate the LSBit.

Two comments:
- float ops used to compute the x crossing => use DDA (integer ops ?)
- store the orientation flag into another byte array to avoid shift
operations << 1 and >> 1 (later in the code) but the insertion sort should
then be smart to provide a sorted index mapping ...

>
>  - last x pixel processing: could you explain me ?
>>   712                                 int pix_xmax = x1 >>
>> SUBPIXEL_LG_POSITIONS_X;
>>   713                                 int tmp = (x0 & SUBPIXEL_MASK_X);
>>   714                                 alpha[pix_x] +=
>> SUBPIXEL_POSITIONS_X - tmp;
>>   715                                 alpha[pix_x + 1] += tmp;
>>   716                                 tmp = (x1 & SUBPIXEL_MASK_X);
>>   717                                 alpha[pix_xmax] -=
>> SUBPIXEL_POSITIONS_X - tmp;
>>   718                                 alpha[pix_xmax + 1] -= tmp;
>>
>
> Are you referring to the 2 += and the 2 -= for each end of the span?  If
> an edge crosses in a given pixel at 5 subpixel positions after the start of
> that pixel, then it contributes a coverage of "SUBPIXEL_POS_X minus 5" in
> that pixel.  But, starting with the following pixel, the total coverage it
> adds for those pixels until it reaches the right edge of the span is
> "SUBPIXEL_POSITIONS_X".  However, we are recording deltas and the previous
> pixels only bumped our total coverage by "S_P_X - 5".  So, we now need to
> bump the accumulated coverage by 5 in the following pixel so that the total
> added coverage is "S_P_X".
>
> Basically the pair of += lines adds a total S_P_X to the coverage, but it
> splits that sum over two pixels - the one where the left edge first
> appeared and its following pixel.  Similarly, the two -= statements
> subtract a total of S_P_X from the coverage total, and do so spread across
> 2 pixels.  If the crossing happened right at the left edge of the pixel
> then tmp would be 0 and the second += or -= would be wasted, but that only
> happens 1 out of S_P_X times and the cost of testing is probably less than
> just adding tmp to the second pixel even if it is 0.
>
> Also note that we need to have an array entry for alpha[max_x + 1] so that
> the second += and -= don't store off the end of the array.  We don't need
> to use that value since we will stop our alpha accumulations at the entry
> for max_x, but testing to see if the "second pixel delta value" is needed
> is more expensive than just accumulating it into an unused array entry.

Thanks for the explanations: I understand now why I should clean the alpha
array more than I thought: xmax + 2 !

>
>  Finally, it seems that hotspot settings (CompileThreshold=1000 and
>> -XX:aggressiveopts) are able to compile theses hotspots better ...
>>
>
> What about if we use the default settings as would most non-server apps ?

As my linux is 64 bits, I can only run benchmark with the server compiler
(hotspot c2) ... I could also test with / without tiered compilation
(between client and server compiler) ...

>
>  Thanks; probably the edgeBucket / edgeBucketCount arrays could be merged
>> into a single one to improve cache affinity.
>>
>
> Interesting.

It could be interesting to evaluate the cpu cache affinity related to all
these arrays: maybe 2D arrays could be packed into 1D arrays to improve
cache efficiency; idem for edge "struct":
it is better to keep int / float arrays instead of having an Edge object
(with instance pool) ?

>  FYI, I can write C/C++ code but I never practised JNI code.
>> Does somebody could help us to port only these 2 hotspot methods ?
>>
>
> Port 2 Hotspot methods?  I'm not sure what you are referring to here?

I mean implement the 'hotspot' methods i.e. implement the rendering
algorithms in C code using JNI to access Renderer fields.
Such C code could be compiled using optimizations (gcc O2) but I have
doubts that it will be faster than optimized code produced by hotspot c2
compiler ...

Maybe somebody could look at the assembly code generated by hotspot to
ensure these methods are 'well' optimized: I looked at it but I am not able
to evaluate if it is good (efficient) code.

PS: Andrea, did you run some benchmarks using the last patch ?

Laurent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/2d-dev/attachments/20130529/a3859b12/attachment.html>