[OpenJDK Rasterizer] RFR: Marlin renderer #2

Tue Jun 9 21:25:54 UTC 2015

Hi Jim,

Here are my comments:

 ArrayCache.get*Bucket() - you decided not to do the shift method?
>>>
>>
>> I gave up: Do you think it is worth as there is only 4 buckets (ie few
>> if) ? Do you have an implementation to propose ?
>>
>
> Once I realized it was exponential growth I realized why you gave up. No
> easy suggestions there.
>

Exaclty: I did not find any mean to do it better !

>  ArrayCache.getNewSize() will return 0 if fed curSize=1.  That may not
>>> happen in practice, but it's a dangerous failure mode to leave around.
>>>
>>
>> It is a bit tricky but 0 is not dangerous here as 0 < MIN_ARRAY_SIZE (=
>> 4096): it means that the first bucket will be used to get the new array
>> => int[4096]
>>
>> final int[] res = getIntArray(getNewSize(usedSize)); => return
>> getIntArrayCache(0).getArray(); // first bucket used
>>
>
> Yeah, the input of 0 would be difficult to see in practice, but the fact
> that one needs to examine who calls this method and how bothers me in
> general.  I prefer defensive programming for things like this that are
> executed once per array growth.  But, that's just a style point.
>

I agree your point of view; let's postpone until we agree on refactoring
the array caches.

>
>  CollinearSimplifier - why the switch from an enum to numeric constants?
>>> Also, why not use a switch statement instead of a cascade of if tests?
>>>
>>
>> Fixed: I now use a integer-based switch.
>>
>
> Looks good.  I'm still curious why you didn't use an enum - is the code
> more efficient just dealing with basic numbers rather than enum instances?
>

I did not compare their performance but I prefer using switch(int) vs
switch(enum) as I would expect it faster. I could try and evaluate the
performance impact.

>   FloatMath.java, line 144 - does "((~intpart) >>> 31)"  (unsigned shift)
>>> work as well?
>>>
>>
>> Fixed.
>>
>
> It doesn't look like my suggestion.  Note that I was doing the complement
> before the unsigned shift.  That leaves you with 1 for positive&0 and 0 for
> negative.  You don't need the "& 1" after it if you do the complement
> before the shift.
>

Sorry you're right: it is better in my microbenchmarks: 75 ns/ops vs 80
ns/ops !

>  FloatMath - for the comments, "0" is considered neither negative nor
>>> positive, so you are technically not identifying what happens with 0 since
>>> you only describe what happens for positive and negative numbers.
>>> "Non-negative" means "0 and all positive numbers".
>>>
>>
>> Fixed comments.
>>
>
> I don't see it fixed.  You added the word "numbers", but you still don't
> include 0 in any of your cases since 0 is neither positive nor negative.
> I'm referring to lines 142,143 & 197,198.  Replace the word "positive" with
> "non-negative" in those comments and you will cover the 0 case.  To be
> clear:
>
> negative number = any number that is less than 0 (and not 0)
> positive number = any number that is greater than 0 (and not 0)
> non-negative number = any number that is greater than or equal to 0
>
> The sign bit is 0 for all positive numbers and also for 0 - i.e. all
> non-negative nubmers
>

I did understand but there is a subtle difference: 0 is an integer so it is
handle by the previous case:

        if (intpart == doppel) {

*            return (int) a; // integral value (including 0)*        }

        // sign: 1 for negative, 0 for positive numbers
        // add : 0 for negative and 1 for positive numbers
        return (int) Float.intBitsToFloat(intpart) + (~intpart >>> 31);

So my comments were correct: 0 is excluded from these cases (already
handled) !

During last week end, I implemented another ceil / floor alternative
implementations that are definitely faster and exact.
Note that NaN must be still handled as a particular case in Marlin:
(int)NaN = 0 => may lead to rendering artifacts.

    // Cast alternative (exact)
    static final boolean CHECK_OVERFLOW = false; // disabled for performance

    private static int ceil_f_cast(final float a) {
        if (CHECK_OVERFLOW && !Float.isNaN(a)) {
            return 0;
        }
        final int intpart = (int)a;

        if (a <= intpart || (CHECK_OVERFLOW && intpart ==
Integer.MAX_VALUE)) {
            return intpart;
        }
        return intpart + 1;
    }

    private static float floor_f_cast(final float a) {
        if (CHECK_OVERFLOW && !Float.isNaN(a)) {
            return a;
        }
        final int intpart = (int)a;

        if (a >= intpart || (CHECK_OVERFLOW && intpart ==
Integer.MIN_VALUE)) {
            return intpart;
        }
        return intpart - 1;
    }

I performed a microbenchmark (peter levart's TestRunner):
#-------------------------------------------------------------
# MathCeil: run duration: 10 000 ms
#
# Warm up (2):
#           4 threads, Tavg =    115,15 ns/op (σ =   0,96 ns/op), Total ops
=    347368812 [   114,58 (87272588),    114,36 (87440553),    114,89
(87032202),    116,79 (85623469)]
#           4 threads, Tavg =    115,20 ns/op (σ =   0,71 ns/op), Total ops
=    347568040 [   114,59 (87354151),    116,41 (85987505),    114,99
(87050535),    114,82 (87175849)]
# Measure:
            1 threads, Tavg =    113,13 ns/op (σ =   0,00 ns/op), Total ops
=     88392563 [   113,13 (88392563)]
            2 threads, Tavg =    112,83 ns/op (σ =   0,26 ns/op), Total ops
=    177252698 [   113,09 (88425761),    112,58 (88826937)]
            3 threads, Tavg =    112,82 ns/op (σ =   0,21 ns/op), Total ops
=    265907557 [   112,68 (88748014),    112,67 (88752494),    113,11
(88407049)]
            4 threads, Tavg =    113,50 ns/op (σ =   0,80 ns/op), Total ops
=    352695086 [   112,81 (88714796),    113,23 (88380249),    114,87
(87114689),    113,10 (88485352)]
#
#-------------------------------------------------------------
# FloatMathCeil: run duration: 10 000 ms
#
# Warm up (2):
#           4 threads, Tavg =     80,79 ns/op (σ =   0,88 ns/op), Total ops
=    495071198 [    82,32 (121484769),     80,25 (124601110),     80,18
(124655607),     80,43 (124329712)]
#           4 threads, Tavg =     82,42 ns/op (σ =   0,31 ns/op), Total ops
=    485601732 [    82,21 (121713792),     82,25 (121653986),     82,26
(121630856),     82,96 (120603098)]
# Measure:
            1 threads, Tavg =     80,03 ns/op (σ =   0,00 ns/op), Total ops
=    124950139 [    80,03 (124950139)]
            2 threads, Tavg =     79,95 ns/op (σ =   0,10 ns/op), Total ops
=    250169814 [    80,04 (124933671),     79,85 (125236143)]
            3 threads, Tavg =     79,94 ns/op (σ =   0,11 ns/op), Total ops
=    375279835 [    80,10 (124848679),     79,84 (125252827),     79,89
(125178329)]
            4 threads, Tavg =     80,35 ns/op (σ =   0,66 ns/op), Total ops
=    498263217 [    79,93 (125241810),     80,09 (124978578),     81,49
(122771419),     79,91 (125271410)]
#
#-------------------------------------------------------------
# FloatMathCeil_ALT: run duration: 10 000 ms
#
# Warm up (2):
#           4 threads, Tavg =     56,10 ns/op (σ =   0,52 ns/op), Total ops
=    713412531 [    55,84 (179166311),     55,76 (179434616),     55,80
(179290427),     57,00 (175521177)]
#           4 threads, Tavg =     55,03 ns/op (σ =   0,41 ns/op), Total ops
=    727581546 [    55,73 (179619584),     54,74 (182874247),     54,75
(182833726),     54,92 (182253989)]
# Measure:
            1 threads, Tavg =     54,85 ns/op (σ =   0,00 ns/op), Total ops
=    182330728 [    54,85 (182330728)]
            2 threads, Tavg =     54,78 ns/op (σ =   0,05 ns/op), Total ops
=    364993290 [    54,73 (182625283),     54,83 (182368007)]
            3 threads, Tavg =     62,86 ns/op (σ =   6,24 ns/op), Total ops
=    477248587 [    54,73 (182701555),     67,83 (147428096),     67,97
(147118936)]
            4 threads, Tavg =     55,06 ns/op (σ =   0,40 ns/op), Total ops
=    727177009 [    54,85 (182491516),     54,75 (182822254),     55,75
(179519147),     54,89 (182344092)]
#
#-------------------------------------------------------------

MathCeil           = 112 ns / ops
FloatMathCeil      =  80 ns / ops
FloatMathCeil_cast =  55 ns / ops (CHECK_OVERFLOW = false)
FloatMathCeil_cast =  65 ns / ops (CHECK_OVERFLOW = true)

With this new implementations, Marlin performance is as fast as the initial
Marlin with approximate ceil / floor implementations:
    // Hack alternative (inexact but precision good enough)
    private static final int BIG_ENOUGH_INT = 1024 * 1024 * 1024;    // 1e9
max
    private static final double BIG_ENOUGH_DBL = BIG_ENOUGH_INT;

    private static int ceil_f_hack(final float a) {
        return BIG_ENOUGH_INT - (int) (BIG_ENOUGH_DBL - a);
    }

    private static float floor_f_hack(final float a) {
        return (int) (a + BIG_ENOUGH_DBL) - BIG_ENOUGH_INT;
    }

To conclude, do you accept the new floor / ceil alternatives ?

>  Helpers - there are a number of /2.0f that you changed to /2f, but how
>>> about *0.5f instead since multiplication is often faster than division.
>>>
>>
>> I tried to replace such division cases by multiplication but I did not
>> get any performance gain: certainly hotspot is doing a great job in
>> optimizing maths => I replaced all "tricky" multiplications by power of
>> 2 divisions to be consistent in Marlin, more mathematically correct and
>> explicit.
>>
>> Maybe / 2f or / 4f can be done by shifiting the exponent part in the
>> assembler code (see scalb) ?
>>
>
> OK, this may be an old performance issue that is no longer reproducible on
> modern chips and compilers.
>
> Out of curiosity, which platforms did you test this on?
>

It is an HP zbook 15 laptop running Ubuntu 14.4 (x64): i7 4800M but I
disabled HyperThreading in BIOS (4 true cpu cores):
Cache L1d :           32K
Cache L1i :           32K
Cache L2 :            256K
Cache L3 :            6144K
NUMA node0 CPU(s):     0-3

> Renderer, line 401 et al - assuming that the initial size of the array is
>>> larger than SIZEOF_EDGE_BYTES otherwise edgePtr<<1 may not be large enough
>>> to add SIZEOF_EDGE_BYTES.  Probably OK in practice for now, but could be a
>>> bug waiting to happen later if someone gets fancy with out that _edges
>>> array is initialized.
>>>
>>
>> I agree it looks tricky as it supposes _edges.length is always >
>> SIZEOF_EDGE_BYTES. I could add a comment ?
>>
>
> Or an assert?  Either or.
>

Will do.

>
>  Renderer.java, line 1260ish - the new way you set spMaxY has it
>>> potentially be 1 larger than it used to be...?
>>>
>>
>> No, I checked again: it is written differently but equivalent. Probably
>> I misnamed the local variable _boundsMaxY = boundsMaxY - 1; I renamed it
>> to _boundsMaxYm1 to be more clear.
>>
>
> Ah, I see it now.  On the other hand, now you end up with the clunky
> "Yminus1 + 1" later on.  Since this code is executed once per rendering
> cycle, I don't see any performance concerns here to require loading the
> fields into local variables.  Just using the raw fields for the 2 simple
> calculations may be clearer on the other hand.
>

I wanted to factorize the substraction and avoid repeating it 2 times. I
agree it is tricky.

>
>  RendererContext - too many methods with the words "dirty"
>>> "byte/int/float", "array", and "cache" in them in different orders. They
>>> all blend together name-wise and I can't determine what each is supposed to
>>> do in order to verify that this all makes sense.  We need better
>>> nomenclature and/or packaging for these facilities, but I don't have any
>>> suggestions off the top of my head other than, for now, please review if
>>> the names of all of them follow some pattern, fix any that don't follow the
>>> pattern and then document the pattern in the code so that hopefully it will
>>> explain itself so I can review it in an informed manner.
>>>
>>
>> I renamed few methods to have a consistent naming convention: see intro.
>>
>
> OK, let me make another pass through then now that we have something
> consistent.  I wanted to get these comments out to you on what I've seen so
> far sooner rather than later, though.
>

Ok, tell me if it is clearer and enough for this second step.

>
>  Stroker.curveTo() - (just observing) we go to all the trouble of storing
>>> known vars into the middle array, but then we access those known vars from
>>> the array in the subsequent code.  But, we know what values we stored there
>>> so wouldn't it be faster just to use the original values, rather than fetch
>>> them back out of an array?  The array in this case serves only to obscure
>>> what data we are computing with.  Although, arguably, we eventually pass
>>> the mid/middle array to other functions so we do still need those values
>>> stored in the array (unless those functions can be converted from array
>>> storage to direct values as well?).  This is just an observation, but not a
>>> problem.
>>>
>>
>> I agree that this part of the code is obscure to me: I was wondering
>> what is the objective ... But the the mid/middle array is used by
>> findSubdivPoints(mid), breakPtsAtTs(mid) and computeOffsetCubic(mid) ...
>> so I did not change much the working code !
>>
>> If you think we could have some performance gain, I may try rewriting
>> that part ... but I think eliminating collinear segments first is more
>> efficient.
>>
>
> This is probably something to look at later after these changes go in. For
> a few minutes of changing a couple of method signatures it might result in
> some cleaner code, but not something to rework just now.
>

Ok.

>
>  Stroker line 1079,1155 - why move the declaration of curCurveOff outside
>>> the loop?
>>>
>>
>> Reverted (even if I prefer declarations outside loops).
>>
>
> I guess I prefer to limit the scope of variables whenever possible.
>

Ok.

> PS: If you prefer, I can replace all divisions by multiplications but it
>> should be consistent accross the code even if it has no impact in  my
>> benchmarks !
>>
>
> No, that was an optimization I've seen in the past, but if you have
> evidence to the contrary we should be fine.
>

I did not observe any performance gain and may lead to issue with floating
point precision ...

Thanks for your comments,

I could prepare a new webrev including my new FloatMath alternatives.

Laurent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/graphics-rasterizer-dev/attachments/20150609/58da49ee/attachment-0001.html>