[OpenJDK Rasterizer] RFR: Marlin renderer #2

Fri Jun 5 21:18:47 UTC 2015

Dear Jim,

thanks for your detailed review, it took me some time to make the
appropriate fixes and publish a new webrev including white space changes
(-b):

http://cr.openjdk.java.net/~lbourges/marlin/marlin-s2.1/

Please read my comments below:

> Let me see if I understand up front what the caching needs are.
>
> You have various places that use (often growable) int[], float[] or
byte[] and you want to share arrays among them of various sizes.  Some of
the arrays are used by code that can tolerate dirty data in the array it
gets, others require the array to be clean before they can use it.
Currently you avoid sharing arrays between code that can use dirty arrays
and code that needs clean arrays?  Did I miss anything?

My main concern is not really caching but reusing arrays accross thousands
RendereringEngine invocations; that's why I sized initial arrays in Marlin
to handle 99% cases = 1 RendererContext ~512Kb referenced by a Soft or Weak
Reference (TL / CLQ).

It is more a smart array recycling than really caching and it is very close
to what GC is made for.
However doing my own memory management is faster for several reasons:
1/ no GC overhead: many large arrays exceed TLAB => many young generation
avoided + no STW
2/ avoid array cleaning passes (zero fill of the whole array)
3/ Better growing algorithm based on buckets of large arrays to avoid lots
of array allocation / copies ...

To achieve the best possible performance, I analyzed carefully how Marlin
use arrays: 1/ growable / fixed-sized arrays 2/ clean vs dirty arrays. For
the few "clean" arrays (edge buckets + counts, alphaLine, touchedTile), I
only perform zero-fill on the dirty part [min; max] to avoid a complete
zero-fill.

Of course, it happens that initial arrays are too small (large image, huge
coverage or very complex shape).
I implemented Byte/Int/Float ArrayCache classes to handle in the same
manner such larger arrays: get() and put().
These caches represent a larger memory footprint so these are gathered in
the ArrayCachesHolder referenced by a Weak Reference.

To maximize the reuse ratio and reduce the memory footprint, I adopted a
bucket approach and a simple growing factor = 2 or 4 (bytes[] only):
INFO: arraySize[0]: 4096
INFO: arraySize[1]: 16384
INFO: arraySize[2]: 65536
INFO: arraySize[3]: 262144
INFO: dirty arraySize[0]: 65536
INFO: dirty arraySize[1]: 131072
INFO: dirty arraySize[2]: 262144
INFO: dirty arraySize[3]: 524288

Note: Larger arrays are never cached and will let the GC handle them.

Of course, I checked Marlin's code (in insane tests) to validate
consistency: clean arrays are never mixed with dirty arrays when got / put
in corresponding caches.

I agree the RendererContext API is very repetitive but needed to support
all these primitive array types efficiently (missing generics<primitive
types>)
I just fixed the naming convention:
- get[Dirty] Byte/Int/Float ArrayCache()
- get[Dirty] Byte/Int/Float Array()
- widen[Dirty] Byte/Int/Float Array()
- put[Dirty] Byte/Int/Float Array()

If you have more questions, please tell me !

> ArrayCache @104: you don't check if oversize or resizeDirtyInt are
non-zero?

Fixed.

> ArrayCache.get*Bucket() - you decided not to do the shift method?

I gave up: Do you think it is worth as there is only 4 buckets (ie few if)
? Do you have an implementation to propose ?

> ArrayCache.get*Bucket() - a for loop 1=>NUMBER requires the runtime to do
bounds checking on the array access.  If you use 1=>array.length then it
can see/prove that you cannot generate OOB indices and it can elide the
checks.

Fixed.

>
> ArrayCache.getNewSize() will return 0 if fed curSize=1.  That may not
happen in practice, but it's a dangerous failure mode to leave around.

It is a bit tricky but 0 is not dangerous here as 0 < MIN_ARRAY_SIZE (=
4096): it means that the first bucket will be used to get the new array =>
int[4096]

final int[] res = getIntArray(getNewSize(usedSize)); => return
getIntArrayCache(0).getArray(); // first bucket used

>
> ArrayCache.BUCKET_GROW_N - I'm finding that these constants obscure what
they are doing possibly because of their names.  I had to look and see how
they are used to figure out what they "meant" and then when I was looking
at how they are used I was thinking - you'd have to use a 1 here or else it
won't work which defeated the purpose of having a constant since there is
only one value that logically works in some of the places.  I'd find the
code easier to read with hard-coded numbers in most of the places these
constants are used.  There is no bug here, but it made the code harder to
understand.

Ok, I understand and removed these constants => replaced by 1 or 2 to be
more readable.

> ByteArrayCache.check() - (I realize that you didn't modify this code, but
it was in the context diffs and I noticed it) - the only way for empty to
be set to false is in the test in the first loop.  Why not just embed the
"if (!empty)" code into that conditional test in the loop and eliminate the
empty variable?

Fixed: shorter code looks better.

> CollinearSimplifier - why the switch from an enum to numeric constants?
Also, why not use a switch statement instead of a cascade of if tests?

Fixed: I now use a integer-based switch.

> CollinearSimplifier.closePath() - if you remember the last move
coordinates then you can first emit lineTo(movex, movey) to give it a
chance to make that closing segment colinear, then do the
emit/delegate.close.

> closePath - technically you could leave this as PREV_POINT with pxy1
being the movexy.

Postponed as it seems a side-corner case.

> quadTo/curveTo - you can leave the state as PREV_POINT with the final
point of the curve as px1,py1.  Otherwise any line following a curve is
never simplified.

Fixed for both methods.

> getSlope() - comparing y2 == y1 and x2 > x1 both require subtracting one
of them from the other.  It might make a slight improvement to start with
"float ydiff;  if ((ydiff = y2 - y1) == 0)" since a compare to 0 following
a subtract should just be a test of the condition codes.  On the other
hand, this might make the code harder to read and quite possibly the
compiler would notice that you need the difference later and do this
optimization for you.

Fixed.

> Curve line 251 - indentation so that the "final float[]"s match up?

Fixed.

> FloatArrayCache - see ByteArrayCache above

Fixed too.

> FloatMath.java, line 144 - does "((~intpart) >>> 31)"  (unsigned shift)
work as well?

Fixed.

> FloatMath - for the comments, "0" is considered neither negative nor
positive, so you are technically not identifying what happens with 0 since
you only describe what happens for positive and negative numbers.
"Non-negative" means "0 and all positive numbers".

Fixed comments.

> Helpers - there are a few "/2", "/3", and "/27" that didn't get converted
to double constants?

Fixed.

> Helpers - there are a number of /2.0f that you changed to /2f, but how
about *0.5f instead since multiplication is often faster than division.

I tried to replace such division cases by multiplication but I did not get
any performance gain: certainly hotspot is doing a great job in optimizing
maths => I replaced all "tricky" multiplications by power of 2 divisions to
be consistent in Marlin, more mathematically correct and explicit.

Maybe / 2f or / 4f can be done by shifiting the exponent part in the
assembler code (see scalb) ?

> IntArrayCache - see ByteArrayCache above

Fixed too.

> MarlinRenderingengine, line 545,546 - here you changed a "*.5" into a
"/2", but the multiply is usually faster than the divide.

Not faster in my benchmarks.

> MergeSort - has no changes in the Sdiffs?

There was an indentation issue at line 120 => but the webrev diff ignores
it !

> Renderer, line 219 (and a few others later in the file) - *.25 may be
faster than /4

Not faster in my benchmarks.

> Renderer, line 401 et al - assuming that the initial size of the array is
larger than SIZEOF_EDGE_BYTES otherwise edgePtr<<1 may not be large enough
to add SIZEOF_EDGE_BYTES.  Probably OK in practice for now, but could be a
bug waiting to happen later if someone gets fancy with out that _edges
array is initialized.

I agree it looks tricky as it supposes _edges.length is always >
SIZEOF_EDGE_BYTES. I could add a comment ?

> Renderer.dispose (and I've seen this in other places) - you clean the
arrays after you've freed and replaced them with their initial values - it
seems kind of odd.  Also, in most of these places the comment says "keep
... dirty", but you are about to clean them so it is the opposite of
"keeping them dirty", so the comment seems to be backwards (I forget what
other file I first noticed this, but it is a comment that has been
cut/pasted in other places too).

Agreed: I moved the cleanup before returning arrays and fixed the comment:
Force zero-fill dirty arrays (if doCleanDirty); doCleanDirty is only used
for debugging purposes !

> Renderer.java, line 1260ish - the new way you set spMaxY has it
potentially be 1 larger than it used to be...?

No, I checked again: it is written differently but equivalent. Probably I
misnamed the local variable _boundsMaxY = boundsMaxY - 1; I renamed it to
_boundsMaxYm1 to be more clear.

> RendererContext - too many methods with the words "dirty"
"byte/int/float", "array", and "cache" in them in different orders. They
all blend together name-wise and I can't determine what each is supposed to
do in order to verify that this all makes sense.  We need better
nomenclature and/or packaging for these facilities, but I don't have any
suggestions off the top of my head other than, for now, please review if
the names of all of them follow some pattern, fix any that don't follow the
pattern and then document the pattern in the code so that hopefully it will
explain itself so I can review it in an informed manner.

I renamed few methods to have a consistent naming convention: see intro.

> Stroker - more *.5 converted to /2 (and higher powers of 2 as well)

Not faster in my benchmarks.

> Stroker - the commented-out somethingTo still uses hardcoded
emit(...true/false) methods.

I reported my changes to let it compile but left commented / untested.

> Stroker.curveTo() - (just observing) we go to all the trouble of storing
known vars into the middle array, but then we access those known vars from
the array in the subsequent code.  But, we know what values we stored there
so wouldn't it be faster just to use the original values, rather than fetch
them back out of an array?  The array in this case serves only to obscure
what data we are computing with.  Although, arguably, we eventually pass
the mid/middle array to other functions so we do still need those values
stored in the array (unless those functions can be converted from array
storage to direct values as well?).  This is just an observation, but not a
problem.

I agree that this part of the code is obscure to me: I was wondering what
is the objective ... But the the mid/middle array is used by
findSubdivPoints(mid), breakPtsAtTs(mid) and computeOffsetCubic(mid) ... so
I did not change much the working code !

If you think we could have some performance gain, I may try rewriting that
part ... but I think eliminating collinear segments first is more efficient.

> Stroker line 1079,1155 - why move the declaration of curCurveOff outside
the loop?

Reverted (even if I prefer declarations outside loops).

PS: If you prefer, I can replace all divisions by multiplications but it
should be consistent accross the code even if it has no impact in  my
benchmarks !

Thanks again for your review and your time,

Regards,

Laurent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/graphics-rasterizer-dev/attachments/20150605/b5ad7367/attachment.html>