[OpenJDK Rasterizer] RFR: Marlin renderer #2
Laurent Bourgès
bourges.laurent at gmail.com
Tue Jun 9 21:25:54 UTC 2015
Hi Jim,
Here are my comments:
ArrayCache.get*Bucket() - you decided not to do the shift method?
>>>
>>
>> I gave up: Do you think it is worth as there is only 4 buckets (ie few
>> if) ? Do you have an implementation to propose ?
>>
>
> Once I realized it was exponential growth I realized why you gave up. No
> easy suggestions there.
>
Exaclty: I did not find any mean to do it better !
> ArrayCache.getNewSize() will return 0 if fed curSize=1. That may not
>>> happen in practice, but it's a dangerous failure mode to leave around.
>>>
>>
>> It is a bit tricky but 0 is not dangerous here as 0 < MIN_ARRAY_SIZE (=
>> 4096): it means that the first bucket will be used to get the new array
>> => int[4096]
>>
>> final int[] res = getIntArray(getNewSize(usedSize)); => return
>> getIntArrayCache(0).getArray(); // first bucket used
>>
>
> Yeah, the input of 0 would be difficult to see in practice, but the fact
> that one needs to examine who calls this method and how bothers me in
> general. I prefer defensive programming for things like this that are
> executed once per array growth. But, that's just a style point.
>
I agree your point of view; let's postpone until we agree on refactoring
the array caches.
>
> CollinearSimplifier - why the switch from an enum to numeric constants?
>>> Also, why not use a switch statement instead of a cascade of if tests?
>>>
>>
>> Fixed: I now use a integer-based switch.
>>
>
> Looks good. I'm still curious why you didn't use an enum - is the code
> more efficient just dealing with basic numbers rather than enum instances?
>
I did not compare their performance but I prefer using switch(int) vs
switch(enum) as I would expect it faster. I could try and evaluate the
performance impact.
> FloatMath.java, line 144 - does "((~intpart) >>> 31)" (unsigned shift)
>>> work as well?
>>>
>>
>> Fixed.
>>
>
> It doesn't look like my suggestion. Note that I was doing the complement
> before the unsigned shift. That leaves you with 1 for positive&0 and 0 for
> negative. You don't need the "& 1" after it if you do the complement
> before the shift.
>
Sorry you're right: it is better in my microbenchmarks: 75 ns/ops vs 80
ns/ops !
> FloatMath - for the comments, "0" is considered neither negative nor
>>> positive, so you are technically not identifying what happens with 0 since
>>> you only describe what happens for positive and negative numbers.
>>> "Non-negative" means "0 and all positive numbers".
>>>
>>
>> Fixed comments.
>>
>
> I don't see it fixed. You added the word "numbers", but you still don't
> include 0 in any of your cases since 0 is neither positive nor negative.
> I'm referring to lines 142,143 & 197,198. Replace the word "positive" with
> "non-negative" in those comments and you will cover the 0 case. To be
> clear:
>
> negative number = any number that is less than 0 (and not 0)
> positive number = any number that is greater than 0 (and not 0)
> non-negative number = any number that is greater than or equal to 0
>
> The sign bit is 0 for all positive numbers and also for 0 - i.e. all
> non-negative nubmers
>
I did understand but there is a subtle difference: 0 is an integer so it is
handle by the previous case:
if (intpart == doppel) {
* return (int) a; // integral value (including 0)* }
// sign: 1 for negative, 0 for positive numbers
// add : 0 for negative and 1 for positive numbers
return (int) Float.intBitsToFloat(intpart) + (~intpart >>> 31);
So my comments were correct: 0 is excluded from these cases (already
handled) !
During last week end, I implemented another ceil / floor alternative
implementations that are definitely faster and exact.
Note that NaN must be still handled as a particular case in Marlin:
(int)NaN = 0 => may lead to rendering artifacts.
// Cast alternative (exact)
static final boolean CHECK_OVERFLOW = false; // disabled for performance
private static int ceil_f_cast(final float a) {
if (CHECK_OVERFLOW && !Float.isNaN(a)) {
return 0;
}
final int intpart = (int)a;
if (a <= intpart || (CHECK_OVERFLOW && intpart ==
Integer.MAX_VALUE)) {
return intpart;
}
return intpart + 1;
}
private static float floor_f_cast(final float a) {
if (CHECK_OVERFLOW && !Float.isNaN(a)) {
return a;
}
final int intpart = (int)a;
if (a >= intpart || (CHECK_OVERFLOW && intpart ==
Integer.MIN_VALUE)) {
return intpart;
}
return intpart - 1;
}
I performed a microbenchmark (peter levart's TestRunner):
#-------------------------------------------------------------
# MathCeil: run duration: 10 000 ms
#
# Warm up (2):
# 4 threads, Tavg = 115,15 ns/op (σ = 0,96 ns/op), Total ops
= 347368812 [ 114,58 (87272588), 114,36 (87440553), 114,89
(87032202), 116,79 (85623469)]
# 4 threads, Tavg = 115,20 ns/op (σ = 0,71 ns/op), Total ops
= 347568040 [ 114,59 (87354151), 116,41 (85987505), 114,99
(87050535), 114,82 (87175849)]
# Measure:
1 threads, Tavg = 113,13 ns/op (σ = 0,00 ns/op), Total ops
= 88392563 [ 113,13 (88392563)]
2 threads, Tavg = 112,83 ns/op (σ = 0,26 ns/op), Total ops
= 177252698 [ 113,09 (88425761), 112,58 (88826937)]
3 threads, Tavg = 112,82 ns/op (σ = 0,21 ns/op), Total ops
= 265907557 [ 112,68 (88748014), 112,67 (88752494), 113,11
(88407049)]
4 threads, Tavg = 113,50 ns/op (σ = 0,80 ns/op), Total ops
= 352695086 [ 112,81 (88714796), 113,23 (88380249), 114,87
(87114689), 113,10 (88485352)]
#
#-------------------------------------------------------------
# FloatMathCeil: run duration: 10 000 ms
#
# Warm up (2):
# 4 threads, Tavg = 80,79 ns/op (σ = 0,88 ns/op), Total ops
= 495071198 [ 82,32 (121484769), 80,25 (124601110), 80,18
(124655607), 80,43 (124329712)]
# 4 threads, Tavg = 82,42 ns/op (σ = 0,31 ns/op), Total ops
= 485601732 [ 82,21 (121713792), 82,25 (121653986), 82,26
(121630856), 82,96 (120603098)]
# Measure:
1 threads, Tavg = 80,03 ns/op (σ = 0,00 ns/op), Total ops
= 124950139 [ 80,03 (124950139)]
2 threads, Tavg = 79,95 ns/op (σ = 0,10 ns/op), Total ops
= 250169814 [ 80,04 (124933671), 79,85 (125236143)]
3 threads, Tavg = 79,94 ns/op (σ = 0,11 ns/op), Total ops
= 375279835 [ 80,10 (124848679), 79,84 (125252827), 79,89
(125178329)]
4 threads, Tavg = 80,35 ns/op (σ = 0,66 ns/op), Total ops
= 498263217 [ 79,93 (125241810), 80,09 (124978578), 81,49
(122771419), 79,91 (125271410)]
#
#-------------------------------------------------------------
# FloatMathCeil_ALT: run duration: 10 000 ms
#
# Warm up (2):
# 4 threads, Tavg = 56,10 ns/op (σ = 0,52 ns/op), Total ops
= 713412531 [ 55,84 (179166311), 55,76 (179434616), 55,80
(179290427), 57,00 (175521177)]
# 4 threads, Tavg = 55,03 ns/op (σ = 0,41 ns/op), Total ops
= 727581546 [ 55,73 (179619584), 54,74 (182874247), 54,75
(182833726), 54,92 (182253989)]
# Measure:
1 threads, Tavg = 54,85 ns/op (σ = 0,00 ns/op), Total ops
= 182330728 [ 54,85 (182330728)]
2 threads, Tavg = 54,78 ns/op (σ = 0,05 ns/op), Total ops
= 364993290 [ 54,73 (182625283), 54,83 (182368007)]
3 threads, Tavg = 62,86 ns/op (σ = 6,24 ns/op), Total ops
= 477248587 [ 54,73 (182701555), 67,83 (147428096), 67,97
(147118936)]
4 threads, Tavg = 55,06 ns/op (σ = 0,40 ns/op), Total ops
= 727177009 [ 54,85 (182491516), 54,75 (182822254), 55,75
(179519147), 54,89 (182344092)]
#
#-------------------------------------------------------------
MathCeil = 112 ns / ops
FloatMathCeil = 80 ns / ops
FloatMathCeil_cast = 55 ns / ops (CHECK_OVERFLOW = false)
FloatMathCeil_cast = 65 ns / ops (CHECK_OVERFLOW = true)
With this new implementations, Marlin performance is as fast as the initial
Marlin with approximate ceil / floor implementations:
// Hack alternative (inexact but precision good enough)
private static final int BIG_ENOUGH_INT = 1024 * 1024 * 1024; // 1e9
max
private static final double BIG_ENOUGH_DBL = BIG_ENOUGH_INT;
private static int ceil_f_hack(final float a) {
return BIG_ENOUGH_INT - (int) (BIG_ENOUGH_DBL - a);
}
private static float floor_f_hack(final float a) {
return (int) (a + BIG_ENOUGH_DBL) - BIG_ENOUGH_INT;
}
To conclude, do you accept the new floor / ceil alternatives ?
> Helpers - there are a number of /2.0f that you changed to /2f, but how
>>> about *0.5f instead since multiplication is often faster than division.
>>>
>>
>> I tried to replace such division cases by multiplication but I did not
>> get any performance gain: certainly hotspot is doing a great job in
>> optimizing maths => I replaced all "tricky" multiplications by power of
>> 2 divisions to be consistent in Marlin, more mathematically correct and
>> explicit.
>>
>> Maybe / 2f or / 4f can be done by shifiting the exponent part in the
>> assembler code (see scalb) ?
>>
>
> OK, this may be an old performance issue that is no longer reproducible on
> modern chips and compilers.
>
> Out of curiosity, which platforms did you test this on?
>
It is an HP zbook 15 laptop running Ubuntu 14.4 (x64): i7 4800M but I
disabled HyperThreading in BIOS (4 true cpu cores):
Cache L1d : 32K
Cache L1i : 32K
Cache L2 : 256K
Cache L3 : 6144K
NUMA node0 CPU(s): 0-3
> Renderer, line 401 et al - assuming that the initial size of the array is
>>> larger than SIZEOF_EDGE_BYTES otherwise edgePtr<<1 may not be large enough
>>> to add SIZEOF_EDGE_BYTES. Probably OK in practice for now, but could be a
>>> bug waiting to happen later if someone gets fancy with out that _edges
>>> array is initialized.
>>>
>>
>> I agree it looks tricky as it supposes _edges.length is always >
>> SIZEOF_EDGE_BYTES. I could add a comment ?
>>
>
> Or an assert? Either or.
>
Will do.
>
> Renderer.java, line 1260ish - the new way you set spMaxY has it
>>> potentially be 1 larger than it used to be...?
>>>
>>
>> No, I checked again: it is written differently but equivalent. Probably
>> I misnamed the local variable _boundsMaxY = boundsMaxY - 1; I renamed it
>> to _boundsMaxYm1 to be more clear.
>>
>
> Ah, I see it now. On the other hand, now you end up with the clunky
> "Yminus1 + 1" later on. Since this code is executed once per rendering
> cycle, I don't see any performance concerns here to require loading the
> fields into local variables. Just using the raw fields for the 2 simple
> calculations may be clearer on the other hand.
>
I wanted to factorize the substraction and avoid repeating it 2 times. I
agree it is tricky.
>
> RendererContext - too many methods with the words "dirty"
>>> "byte/int/float", "array", and "cache" in them in different orders. They
>>> all blend together name-wise and I can't determine what each is supposed to
>>> do in order to verify that this all makes sense. We need better
>>> nomenclature and/or packaging for these facilities, but I don't have any
>>> suggestions off the top of my head other than, for now, please review if
>>> the names of all of them follow some pattern, fix any that don't follow the
>>> pattern and then document the pattern in the code so that hopefully it will
>>> explain itself so I can review it in an informed manner.
>>>
>>
>> I renamed few methods to have a consistent naming convention: see intro.
>>
>
> OK, let me make another pass through then now that we have something
> consistent. I wanted to get these comments out to you on what I've seen so
> far sooner rather than later, though.
>
Ok, tell me if it is clearer and enough for this second step.
>
> Stroker.curveTo() - (just observing) we go to all the trouble of storing
>>> known vars into the middle array, but then we access those known vars from
>>> the array in the subsequent code. But, we know what values we stored there
>>> so wouldn't it be faster just to use the original values, rather than fetch
>>> them back out of an array? The array in this case serves only to obscure
>>> what data we are computing with. Although, arguably, we eventually pass
>>> the mid/middle array to other functions so we do still need those values
>>> stored in the array (unless those functions can be converted from array
>>> storage to direct values as well?). This is just an observation, but not a
>>> problem.
>>>
>>
>> I agree that this part of the code is obscure to me: I was wondering
>> what is the objective ... But the the mid/middle array is used by
>> findSubdivPoints(mid), breakPtsAtTs(mid) and computeOffsetCubic(mid) ...
>> so I did not change much the working code !
>>
>> If you think we could have some performance gain, I may try rewriting
>> that part ... but I think eliminating collinear segments first is more
>> efficient.
>>
>
> This is probably something to look at later after these changes go in. For
> a few minutes of changing a couple of method signatures it might result in
> some cleaner code, but not something to rework just now.
>
Ok.
>
> Stroker line 1079,1155 - why move the declaration of curCurveOff outside
>>> the loop?
>>>
>>
>> Reverted (even if I prefer declarations outside loops).
>>
>
> I guess I prefer to limit the scope of variables whenever possible.
>
Ok.
> PS: If you prefer, I can replace all divisions by multiplications but it
>> should be consistent accross the code even if it has no impact in my
>> benchmarks !
>>
>
> No, that was an optimization I've seen in the past, but if you have
> evidence to the contrary we should be fine.
>
I did not observe any performance gain and may lead to issue with floating
point precision ...
Thanks for your comments,
I could prepare a new webrev including my new FloatMath alternatives.
Laurent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/graphics-rasterizer-dev/attachments/20150609/58da49ee/attachment-0001.html>
More information about the graphics-rasterizer-dev
mailing list