[OpenJDK 2D-Dev] Pisces AA renderer performance improvement.

Sat Aug 6 00:30:37 UTC 2011

Hi Denis,

I've finally started wrapping my head around this.  It's a nice new take 
on reorganizing the work and I'm intrigued by the fact that you said 
that this is getting close to beating Ductus.

Some comments:

- Side note - I could repurpose the previous incarnation of the file for 
FX rendering with just a few minor modifications - it looks like those 
days are over now, though, and I'll have to massage it quite a bit more 
to share the code... (Frowny face for me)

- I'd like to see the Renderer cache implemented in the RenderingEngine 
code since it is outside the scope of the work that Renderer does. 
Also, your table could fill up with partially used renderers if there 
are ever any bugs that cause it to throw an exception before it calls 
giveBack().  I'd like to see some code that tries to make sure they are 
always given back even if errors happen (or, at least, they are removed 
from the data structures if we feel that they aren't appropriate to 
reuse after the error).

- Renderer, line 158 - "while (--toadd >= 0)" allows the compiler to use 
the condition codes from the decrement to satisfy the test.

- Why not just have 1 tracker that manages TILE_SIZE<<SUB_POS values 
instead of TILE_SIZE trackers each of which manages SUB_POS values? 
Unfortunately that means that EvenOdd would have to keep an array 
(because 32*8 is more bits than any single primitive type we have), but 
since we already need to have an array of the EvenOdd trackers, we 
aren't any worse off and in fact we are ahead on storage because 
currently you have 32*32 bits per set of trackers (plus overhead for 32 
tracker objects).

Some suggestions for future work:
(Don't need to deal with these now, but I'll throw the ideas on the 
table so they don't get lost)

- or and line don't really need a full 32-bits.  They could share 
storage, or they could be stored in byte[] arrays, or they could even 
share a byte[] array.

- if you store last_subpixy instead of numys then you wouldn't have to 
keep storing the new number back in the arrays - it would become a 
read-only value.

- I think it would be worth it to amortize the removal of dead pointers. 
  All it would take would be to store a NULL in the spot and set a 
boolean, then they could all be collapsed out in one pass at the end. 
To get even more creative, if you sort them the other way around and 
then scan from count=>0 then you could just copy the value "ptrs[i] = 
ptrs[--count]" and let the insertion sort at the start of the next pixel 
row deal with it.

- Another way to manage the merge sort would be to have a temp array of 
SUB_PIX_Y values, quickly store all of the crossings into it either 
forwards or backwards in a pair of tight loops (one for slope >0 and one 
for slope <0) and then do a regular merge sort in a second tight loop. 
If you use the values at [ci,ci+toadd) as the temp array then one test 
could skip the merge sort step entirely because the values would already 
be sorted (and this is likely to happen).  My thoughts were that 2 
tighter loops might be faster than one more complicated loop.

I'm going to send this now before I get too bogged down in figuring out 
the new design of the TileGenerator...

			...jim

On 7/19/2011 2:01 PM, Denis Lila wrote:
> Hi Jim.
>
> We spoke about this a while ago and I started working
> on it. This is what I have so far:
> http://icedtea.classpath.org/~dlila/webrevs/RendererPerf/
>
> My main intention was to remove some stages
> from the renderer (like you suggested) because what
> we were doing was:
> 1. Transform curves to lines and put the lines in a buffer.
> 2. Iterate through scanlines, get crossings with the lines,
>     compute alphas, and store them into an array.
> 3. Compress the alpha array using RLE.
> 4. When asked, use the RLE encoding to generate alpha tiles.
>
> and I was hoping to be able to go directly from curves to
> crossings, and then generate the tiles from the crossing
> data. This would reduce the intermediate storage and would
> move around much less data.
>
> At first, I implemented this with an int[][] where each inner
> array represented the crossings in one pixel row, and the
> crossings themselves packed 3 chunks of data into one int:
> the x coordinate of the crossing, the orientation, the scanline
> within the pixel row where the crossing was located (this was
> a number in [0:1<<SUBPIXEL_LG_POSITIONS_Y) ).
> The crossings needed to be sorted, and I did this with Arrays.sort.
> This was fast in small tests, but for any shape with more than 4 edges
> per pixel row it started having huge scalability problems
> (for larger shapes we were spending 90+% of the time in
> quicksort).
>
> The version in the webrev is the fastest yet, and it uses
> a class similar to the ScanlineIterator for the sorting
> (but now it iterates through pixel rows, not scanlines).
> Unfortunately this means that there still are two levels
> of intermediate storage (edges and crossings, analogous
> to edges and RLE elements in the old version), but the
> only way I can think of how to get rid of one of them
> and still take use an average linear time sort is if we
> start outputting alphas pixelrow by pixelrow, rather than
> in 32x32 tiles.
>
> Anyway, it's much faster than it used to be, so we should
> probably push it (after I implement a cache eviction
> policy).
>
> Any suggestions/criticisms are always welcome.
>
> Thank you,
> Denis.