[OpenJDK Rasterizer] Fwd: Re: Fwd: RFR: Marlin renderer #3

Mon Aug 24 22:08:04 UTC 2015

Hi Laurent,

I'm feeling much better this week (and can even start typing with my bad 
hand in short spurts now), so I'll hopefully be more on the ball now.

Short version - looks good to go, consider the following input at your 
own discretion and go ahead and push.

Reviewing s3.4:

On 8/17/15 2:34 PM, Laurent Bourgès wrote:
> Changes:
> - Renderer: remove unused D_ERR_STEP_MAX and 0x1 replaced by 1 for
> readability

Renderer, line 319 - it makes sense to me that the scale changes would 
be based off of the same metric.  Basically the constraint is keeping 
the speed (dxy) between INC and DEC bounds.  I'll admit, though, that I 
haven't been through the actual derivations of the math here, but I 
thought I'd give my 2 cents on your comment there.

Renderer, line 479 - I often put extra parens in for cases like this 
because it isn't common knowledge whether cast or add is higher in 
precedence.  I generally assume no better knowledge than the basic 
PEMDAS rules that we learn in grade school. But that's just my style (I 
also don't have the OOO table memorized, but maybe that's just my 
senility ;).

0x1 vs 1 - I usually use 0x1 when I'm using numbers for their bit 
patterns, so | and & typically use 0x constants in my style philosophy. 
  In the case of "1" it's a degenerate case that doesn't seem to matter, 
but the added 0x tells me to think in terms of bits, not cardinal values.

Renderer, note for future investigation - we store the direction flag in 
Y_MAX, but then we need to do manipulation to process Y_MAX and we also 
need more manipulation to transfer that bit to crossing values.  What if 
instead we stored it in the lowest bit of curx?  We'd then have 31.31 
fixed point, we'd have to process slope so that its whole part was 
doubled (to skip affecting the LSbit of curx) and we'd have to do the 
carry using "((error >> 31) << 1)", but we'd get the transfer of the 
direction bit to the crossing for free and all in all, the total 
operations may be reduced (at the cost of 1 bit of X coordinate range).

> - MarlinRenderingEngine: optimized loop in pathTo()

Looks good

> - Dasher: TODO remains related to dash length issue

OK.

> Hope it is good now, I plan to work soon on the Array Cache & dispose code.

Looks great!  Consider the style issues I mentioned above if you wish 
and otherwise it is good to go.  I don't need to see any further webrevs 
on those issues.

One more suggestion for optimization below can be considered in 
subsequent work:

>     For CHECK_NAN, how fast is Math.isNaN?  Is it worth first comparing
>     the intpart to the value that NaN converts to before calling isNaN?
>     The majority case, by a wide margin, is that the value is not NaN.
>
>
> That's the aim: I test first the most probable case (a <= intpart) for
> ceil() and then check possible overflow and finally check for NaN values
> to avoid adding / substracting 1 !
> I made benchmarks and this implementation optimized for the integer
> domain works very well: 2x faster.

I guess I wasn't clear.  I understand why the 3 parts of the test are in 
the order that they are in, but I was referring only to the "CHECK_NAN 
&& Float.isNaN(a)" clause.  Is that worth adding one more test for 
"CHECK_NAN && intpart == NAN_TO_INT_VAL && Float.isNaN(a)" in that one 
clause?

For the ceil_int case, a very common case is that a is non-negative and 
non-integer.  In that case, "a <= intpart" will fail and you will do the 
CHECK_NAN test before you can return "intpart+1".

For the floor_int case, you would only get to the CHECK_NAN case on 
negative-non-integers which we don't expect a lot of, but in general, 
the added test might make it slightly faster.

>     I was under the understanding that if a constant is declared final
>     and static, then it is compiled directly into the code and there is
>     no constant lookup.  Check the bytecodes and you should see a
>     hard-coded literal 0x7fffffff everywhere it is used...
>
>
> I agree but it seems slightly faster: probably loading the literal is a
> bit slower than loading from the stack ?
> Maybe some hotspot experts could give their opinion.
> As this is the renderer hot loop, I prefer making it as fast as possible.

In the case of 0x7fffffff it may be because of the size of the constant. 
  A smaller literal like "1" or "2" might be faster as an in-lined 
source literal.

>     I think that at least for now they should either be
>     sun.java2d.renderer.marlin.* or
>     just sun.java2d.marlin.*
>
>
> Ok, I can rename them in the next webrev as it requires me to upgrade
> all my testing & benchmarking scripts...

Sounds fine to me.  This is more "something we need to finalize before 
integrating into the master workspace" than an issue for any given webrev.

>     But the more important question may be around how much benefit these
>     bring compared to the additional testing overhead of so many
>     combinations ?
>
>
> I advocate it is not easy to test all combinations but it seems not a
> reasonable testing objective.
>
> As I said, it is mainly tuning or quality settings but only few are
> important (subpixels X / Y, tile size, initial pixel size).

Another thing to consider is that customers would have to do lots of 
testing of performance of tweaking these values and so they would 
discover the problems and report them to us before they become reliant 
on them.  However, down the road they may accept version+1 of Marlin 
into their existing environment without fully testing and we may 
introduce a bug that might take them by surprise.

(Again, something to discuss and decide in the long term - not for this 
particular patch)...

			...jim