[OpenJDK Rasterizer] RFR: Marlin renderer #2

Thu Jun 18 21:46:17 UTC 2015

Hi Jim,

Here is my detailled answer:

2015-06-17 22:43 GMT+02:00 Jim Graham <james.graham at oracle.com>:

> Hi Laurent,
>
> There were 1 or 2 very minor fixes I suggested in a couple of files that
> you indicated you were going to do, but I didn't see a new webrev.  Is
> there one?  I think the existing array stuff is probably OK and I'll look
> at it again now that I understand the tradeoffs of the current design of
> it, but I wanted to know if there was a newer webrev for me to review based
> on those other minor fixes?
>

Sorry, I did not make a new webrev: I took time to review again the array
cache / reset state code and tried some simplifications (postponed to #3).

On 6/12/15 5:45 AM, Laurent Bourgès wrote:
>
>>     A general thought - should we have foo_initial arrays if we are
>>     getting everything from a cache?
>>
>> I adopted this approach to maximize performance for the 99% cases: I
>> sized properly the initial arrays and they takes ~ 512Kb per
>> RendererContext: the ArrayCache (Weak) is only used for extreme cases (a
>> bit slower). In march 2013, I started by using the cache for all array
>> accesses and it was 30% slower !
>>
>
> Was the 30% back when we were allocating or clearing all of the arrays?
> Now that we only clear what we use, the initial sizes of the arrays
> shouldn't matter so much.  It would be worth a little extra size overhead
> to not have so many special cases and special handling for the "_initial"
> arrays if the performance is now essentially identical.
>

I agree that the lastest webrev added drity caches for float / int so less
arrays must be zero-filled before putting them back in the cache.

It improved the gap ~ 3% in average:
==> marlin_REF.log <==
Threads    4    1    2    4
Pct95    237.848    233.887    238.430    241.226

==> marlin_NO_INITIALS.log <==
Threads    4    1    2    4
Pct95    244.261    241.116    244.028    247.639

However, the performance gap is more important for maps with many shapes: ~
6% (4T)

REF:
Test                                             Threads    Ops    Med
Pct95    Avg    StdDev    Min    Max    TotalOps    [ms/op]
dc_shp_alllayers_2013-00-30-07-00-47.ser         1    25    762.882
764.781    763.136    0.715    762.219    765.110    25
dc_shp_alllayers_2013-00-30-07-00-47.ser         2    50    768.975
774.134    769.167    4.105    764.517    774.750    50
dc_shp_alllayers_2013-00-30-07-00-47.ser         4    100    770.511
775.448    770.448    4.668    765.125    787.473    100

NO_INITIALS:
Test                                             Threads    Ops    Med
Pct95    Avg    StdDev    Min    Max    TotalOps    [ms/op]
dc_shp_alllayers_2013-00-30-07-00-47.ser         1    25    808.793
810.227    808.959    0.752    807.831    810.871    25
dc_shp_alllayers_2013-00-30-07-00-47.ser         2    50    813.435
815.646    813.357    1.361    811.076    817.711    50
dc_shp_alllayers_2013-00-30-07-00-47.ser         4    100    815.775
823.593    817.352    6.752    813.031    872.658    100

A notable difference between initial_arrays and cache consists in GC:
initial_arrays are well_sized and stored in the RendererContext directly
(soft reference in my benchmarks) but the cache is stored using a weak
reference => more GC overhead.

I think it possible to encapsulate the initial array into a CacheReference
that will handle both cache (appropriate Dirty / Clean ArrayCache) & the
initial array (tuned size) to simplify the renderer's code.

>
>  I will look again at these cases: I agree "leaking refs" should be set
>> to null to be sure. However, Marlin code is always executing the pattern
>> init() / dispose() so leaking pointers are not critical but OK let's set
>> all of them to null.
>>
>> For initial arrays, it seems redundant to set to null in dispose() and
>> set to initial in init() = useless and counter-productive !
>>
>
> It gets easier to review and maintain code if there is a clear reset to
> some initial conditions is all.  I would like to try to keep the code as
> simple and straightforward as possible while evaluating algorithmic
> improvements without introducing potential bugs with a lot of special case
> code and then we can worry about the 1%'s later.
>

Agreed: I will make efforts in #3 step: improve / simplify the init /
dispose sequence = reset state in an uniform manner.

Notably: the Stroker / Dasher dispose() must be called as early as possible
from Renderer directly (endRendering and / or dispose) in case of exception.

> Also, we may find moving the code to native to be more of a performance
> improvement than trying to fight with Hotspot's load/store policies and so
> any code that tries to make "slightly more optimal byte codes" would end up
> not being needed and would add some noise to any such native conversion.
>

Up to now I prefer coding java as hotspot does all that optimization for
free and it avoids me coding in C++ : error prone and a lot more complex to
make it clear.
Moreover, I installed and tried JITWatch + hotspot's assembler plugin to
let me look at the asm code (+ hints): if I have some free time, I could
have a look and see ...

> But, I do agree that technically you are correct about some of these
> suggestions leading to some provably redundant code.  The issue is more
> that constantly having to reprove the redundancies gets in the way of some
> other progress we could be making.
>

Ok, I plan to rewrite the dispose code in #3 step.

And, on the other hand, right now you are the engineer driving this from
> the workload point of view (and I apologize if I am inadvertently in a role
> that seems to be described as "the engineer getting in the way"), so I'm
> willing to rely on your commitment to be diligent about these
> optimizations.  I'm just stating a case for whether they should be the
> focus yet.
>

Thanks; if you need, I can help you running MapBench on your side.

For me, Marlin is fast enough now (like ductus) to focus on quality: I
mentioned proper gamma correction (but text rendering issues) & resampling
filters (moire...) or adaptive super sampling.

>  Sorry it is not a flaw but an advantage: the initial arrays that can
>> grow in widenXXXArray() have an odd size as indicated by the comment:
>>      // +1 to avoid recycling in Helpers.widenArray()
>>      private final int[] edgePtrs_initial  = new int[INITIAL_SMALL_ARRAY
>> + 1]; // 4K
>>
>
> I saw that after the fact, but my main point remains that this is a
> fragile inter-module optimization.  I was unaware that it could save us 30%
> to have special-cased "initial" arrays, and that fact makes this
> relationship that much more important but I wonder if that is still true.

You're right: now it is ~ 5%. However, it also generates more garbage (weak
reference) and maybe a bigger memory footprint: so for web servers, it may
be a lot worse !

>
>   If we ditch the initial arrays because the new "dirty array" management
> removes that performance penalty then this all goes away. And, even if we
> keep the initial array concept it would be nice to find a more reliable way
> to manage that than a side effect of its length.
>

I propose to keep it but use a better Object oriented design =
CacheReference (Cache & initial array references) very close to what you
proposed so it will handle consistently all the initial array & caching
complexity per array type.

>  I wanted to have a simple but robust API that can be easily used in
>> Marlin.
>>
>
> I'd like to think that I'm working/promoting towards simple and robust
> APIs as well.
>

Of course, you are.

>
>  I agree the documentation can be improved a lot and I could introduce a
>> new helper method in ArrayCache to create the initial arrays in a safer
>> and explicit way :
>>
>> int[] createNonCachedIntArray(int length) {
>> // return only odd length to detect wrong cache usage
>>      if (length & 1 == 0) {
>>          length +=1;
>>      }
>>      return new int[length];
>> }
>>
>
> In a future effort, if we find that we must keep some small initial arrays
> for performance, we could make this a more explicit and tunable parameter
> on the caches.  Let's table that until we consider if they buy us any
> performance after these dirty array changes.
>

Ok, let's see...

> Anyway I agree the cache handling can be improved a lot: try using
>> generics for array types with a lambda factory to allocate them.
>>
>
> I just wish that generics could handle primitive arrays, but they can't.
> I did a similar generification thing for the raster format conversion code
> in FX, but I had to play games with the generic types being ByteBuffer and
> IntBuffer and having the variants of those accept the proper type of
> primitive arrays.  It wasn't pretty, but it did work out in the end.
>

I know generics do not work but it seems possible to use abstract methods:
ArrayCache<K>
- <K> create (length)
- copy (<K> array, ...)
- fill(<K> array) ...

implemented by
IntArrayCache extends ArrayCache<int[]>
- int[] create (length) { return new int[length]; }
- copy (int[] array) { System.arraycopy(array ... }

However, it may give ugly code ... wrapping all array utility methods by
abstract methods (length, fill, copy, ...)

If I do not work or is too slow, I could use a code generator to generate
all needed variants from a template. Do you have such code generator in
openjdk ?

>
>      The caching code is an example of "code with no obvious flaws
>>     because it is hard to figure out if it has flaws".  We should strive
>>     for "code with obviously no flaws because it is easy to verify that
>>     it is doing everything correctly".
>>
>>
>> I disagree the term "flaw" but I agree it is hard to figure out what is
>> done.
>>
>
> I took that from a talk on producing more secure code.  I think it
> originally came from an overall system design that could easily be proven
> secure.  The first release of Java required all sensitive Java code to be
> constantly asking "am I in the context that this operation is allowed?" and
> the detection was fragile.  It was replaced by code where all sensitive
> operations were rejected unless the surrounding code proved its protected
> context which was much more secure.  "Disallowed unless requested" was
> harder to make a mistake than "allowed unless it looks questionable".
>
> But, that same philosophy serves a lot of code well.  If you start having
> a bunch of "agreements" between modules then they can quickly get out of
> synch through regular maintenance and sometimes spiral out of control and
> you spend more time verifying that you're still satisfying all of these
> conventions rather than letting the code make sure things are kosher
> because only valid operations are expressable.
>

Thanks for the explanations, I totally agree and sometimes I write too
complicated code (or fancy).

>
>  OK, I can work later on a new framework:
>> - ArrayCache<K> abstract class where <K> is byte[], int[], float[]
>> - CleanArrayCache<K> extends ArrayCache<K> = clean arrays
>> - DirtyArrayCache<K> extends ArrayCache<K> = dirty arrays
>>
>
> I think I can trust the existing architecture for at least a putback of
> the last webrev that you sent out while we braingstorm a better caching
> mechanism - let me make one last pass through it and I'll send an OK later
> this afternoon.
>

Perfect.

>
>  What do you mean by "meking further real improvements in the algorithms" ?
>>
>
> Things like investigating DDA in the inner loops, alternate ways to manage
> crossings that require less sorting, enhancing the stroker to do
> non-uniform scaling (which would eliminate the double transform in some
> cases), potentially passing the crossings to the output loop rather than
> having to convert them into an alpha array, etc.  A lot of that is easier
> to review if there aren't a lot of extra "this line added 1% performance"
> code adding visual noise.
>
>  Who could help ? I am the only one working concretely on this code, help
>> wanted.
>>
>
> I'd love to get more people involved, but history tells us that this is a
> fairly lonely field.  :(
>

Too bad: it costed me a lot of evenings in past months so I wonder how long
I can afford such intense effort (family, holidays) !

I may have some time now that FX is in tightened bug-fix mode to do some
> tinkering, but mainly I will probably be much quicker on getting reviews
> done which may help as much as having another pair of hands in practice...
>

Great news: you will definitely help me a lot by making reviews, giving
advices and maybe some code.

Laurent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/graphics-rasterizer-dev/attachments/20150618/18be0063/attachment-0001.html>