[OpenJDK 2D-Dev] sun.java2D.Pisces renderer Performance and Memory enhancements

Tue Jul 2 02:56:27 UTC 2013

Hi Laurent,
On 5/29/13 2:01 AM, Laurent Bourgès wrote:
> Two comments:
> - hotspot takes some time to optimize the very large
> Renderer.endRendering() and ScalineIterator.next() methods: probably
> because these methods represents too many byte codes (many loops ...)
> - I tried merging them but it seems that it becomes slower (too big
> method ? or hotspot produce less optimized code according to its
> heuristics ... warmup issue ?)
> - Do you think it could help hotspot if such methods are split in
> smaller ones ?

Perhaps it is another nudge to consider a native port where we could 
statically compile these critical methods with a high degree of 
optimization?

> - Many float operations are used to compute crossings that could be
> replaced by DDA (integer ops) to improve performance ...

Are you referring to fixed point implementations?  Or something more 
precise?  I'm worried about error accumulation with fixed point 
differencing across a large number of pixels.

> Remaining items:
> - clipping issues (dasher emits many segments even out of bounds)
> - rounding issues: float to int conversions (biais or not)
>
> Finally I propose to focus on optimizing these 2 algorithms (methods) as
> it represents 55% of cpu time:
> - optimize / modify algorithms: integer instead of float ops, reduce
> number of condition statements (if) to improve branch prediction ...
> - or / and port these methods into C code (JNI)

Porting to C might help with a lot of the integer conversions that 
happen when we try to stuff "all data into floats".

>         I think that the ScanLineIterator class is no more useful and
>         could be
>         merged into Renderer directly: I try to optimize these 2 code paths
>         (crossing / crossing -> alpha) but it seems quite difficult as I
>         must
>         understand hotspot optimizations (assembler code)...
>
>
>     I was originally skeptical about breaking that part of the process
>     into an iterator class as well.
>
>
> I tried to merge these methods into Renderer.endRendering() but
> benchmark results seems worse: probably hotspot generates less optimized
> code ...

D'oh!

>         For now I want to keep pisces in Java code as hotspot is efficient
>         enough and probably the algorithm can be reworked a bit;
>         few questions:
>         - should edges be sorted by Ymax ONCE to avoid complete edges
>         traversal
>         to count crossings for each Y value:
>
>            156             if ((bucketcount & 0x1) != 0) {
>            157                 int newCount = 0;
>            158                 for (int i = 0, ecur; i < count; i++) {
>            159                     ecur = ptrs[i];
>         *  160                     if (_edgesInt[ecur + YMAX] > cury) {
>         *  161                         ptrs[newCount++] = ecur;
>
>            162                     }
>            163                 }
>            164                 count = newCount;
>            165             }
>
>
>     This does not traverse all edges, just the edges currently "in play"
>     and it only does it for scanlines that had a recorded ymax on them
>     (count is multiplied by 2 and then optionally the last bit is set if
>     some edge ends on that scanline so we know whether or not to do the
>     "remove expired edges" processing).
>
>
> Agreed, this algorithm is the "well know" edge eviction from active edge
> list => move it into dedicated method ?

That would depend on the "hotspot thinks this method is too big" issue 
that we don't really fully understand, but it sounds like a good point 
to investigate.

>         - why multiply x2 and divide /2 the crossings (+ rounding issues) ?
>
>            202             for (int i = 0, ecur, j; i < count; i++) {
>            203                 ecur = ptrs[i];
>            204                 curx = _edges[ecur /* + CURX */];
>            205                 _edges[ecur /* + CURX */] = curx +
>         _edges[ecur + SLOPE];
>            206
>         *  207                 cross = ((int) curx) << 1;
>         *  208                 if (_edgesInt[ecur + OR] != 0 /* > 0 */) {
>            209                     cross |= 1;
>
>
>     The line above sets the bottom bit if the crossing is one
>     orientation vs. the other so we know whether to add one or subtract
>     one to the winding count.  The crossings can then be sorted and the
>     orientation flag is carried along with the values as they are
>     sorted.  The cost of this trick is having to shift the actual
>     crossing coordinates by 1 to vacate the LSBit.
>
>
> Two comments:
> - float ops used to compute the x crossing => use DDA (integer ops ?)

Provided you can keep precision high enough to avoid errors.  Also, 
sometimes offloading some of your calculations to a float unit keeps 
added parallelism in the code and moving the instructions back to int 
can actually slow it down.

> - store the orientation flag into another byte array to avoid shift
> operations << 1 and >> 1 (later in the code) but the insertion sort
> should then be smart to provide a sorted index mapping ...

You then have added memory accesses and remember that we sort these 
values so you'd have to move the flags along with the crossings.  A 
C-struct that had an int and a flag would avoid the shifts while making 
it reasonable to sort just one array at the cost of storage complexity 
and wasted memory for the flag to keep the structs aligned.

>         Finally, it seems that hotspot settings (CompileThreshold=1000 and
>         -XX:aggressiveopts) are able to compile theses hotspots better ...
>
>
>     What about if we use the default settings as would most non-server
>     apps ?
>
>
> As my linux is 64 bits, I can only run benchmark with the server
> compiler (hotspot c2) ... I could also test with / without tiered
> compilation (between client and server compiler) ...

We have to understand the performance on a variety of compilers.

>         Thanks; probably the edgeBucket / edgeBucketCount arrays could
>         be merged
>         into a single one to improve cache affinity.
>
>
>     Interesting.
>
>
> It could be interesting to evaluate the cpu cache affinity related to
> all these arrays: maybe 2D arrays could be packed into 1D arrays to
> improve cache efficiency; idem for edge "struct":
> it is better to keep int / float arrays instead of having an Edge object
> (with instance pool) ?

I'm a bit lost with that, but I think we could only know by prototyping 
it.  Also, perhaps we could store floats into int arrays using 
toFloat/IntBits() faster than we could cast to integers?  Note that 
those methods for conversion to intbits are likely handled internally in 
compiled code as simple "casts" (where the compiled code doesn't really 
do anything to "cast" a value).

>         FYI, I can write C/C++ code but I never practised JNI code.
>         Does somebody could help us to port only these 2 hotspot methods ?
>
>
>     Port 2 Hotspot methods?  I'm not sure what you are referring to here?
>
>
> I mean implement the 'hotspot' methods i.e. implement the rendering
> algorithms in C code using JNI to access Renderer fields.
> Such C code could be compiled using optimizations (gcc O2) but I have
> doubts that it will be faster than optimized code produced by hotspot c2
> compiler ...

I tried for JavaFX and it was a grueling process and I ended up with a 
bastardized half-baked port.  I would recommend doing it right by simply 
implementing C code using the Java code as a guide.

			...jim