From bourges.laurent at gmail.com Fri Jan 15 17:14:46 2016 From: bourges.laurent at gmail.com (=?UTF-8?Q?Laurent_Bourg=C3=A8s?=) Date: Fri, 15 Jan 2016 18:14:46 +0100 Subject: [OpenJDK Rasterizer] AWT & gcc 4.8 optimization options In-Reply-To: References: <5627E1B0.4060206@oracle.com> <5658396E.2090605@oracle.com> <56683DF8.1020607@oracle.com> Message-ID: Sergey, Did you made any progress ? I finally looked at the preprocessed C code and also enabled ftree-vectorizer-verbose output: CFLAGS := -save-temps -ftree-vectorize -ftree-vectorizer-verbose=2 $(CFLAGS_JDKLIB) $(LIBAWT_CFLAGS), \ I looked at the IntArgbPreSrcMaskFill hotspot (in my EllipseFillTest) according to oprofile: samples % image name symbol name 469141 30.0043 libawt.so IntArgbPreSrcMaskFill Here is the preprocessed C code: - It is still complex to read as there are many do { } while (0) blocks due to macro expansion... void IntArgbSrcMaskFill (void *rasBase, jubyte *pMask, jint maskOff, jint maskScan, jint width, jint height, jint fgColor, SurfaceDataRasInfo *pRasInfo, NativePrimitive *pPrim, CompositeInfo *pCompInfo) { jint srcA; jint srcR, srcG, srcB; jint rasScan = pRasInfo->scanStride; IntArgbDataType *pRas = (IntArgbDataType *) (rasBase); jint DstPix; do { (srcB) = (fgColor) & 0xff; (srcG) = ((fgColor) >> 8) & 0xff; (srcR) = ((fgColor) >> 16) & 0xff; (srcA) = ((fgColor) >> 24) & 0xff; } while (0); if (srcA == 0) { srcR = srcG = srcB = 0; fgColor = 0; } else { if (!(0)) { fgColor = (srcA << 24) | (fgColor & 0x00ffffff); ; } if (srcA != 0xff) { do { srcR = mul8table[srcA][srcR]; srcG = mul8table[srcA][srcG]; srcB = mul8table[srcA][srcB]; } while (0); } if (0) { ; } } DstPix = 0; ; rasScan -= width * 4; if (pMask) { pMask += maskOff; maskScan -= width; do { jint w = width; ; do { jint resA; jint resR, resG, resB; jint dstF; jint pathA = *pMask++; if (pathA > 0) { if (pathA == 0xff) { (pRas)[0] = (fgColor); } else { ; dstF = 0xff - pathA; do { DstPix = (pRas)[0]; resA = ((juint) DstPix) >> 24; } while (0); resA = mul8table[dstF][resA]; if (!(0)) { dstF = resA; } resA += mul8table[pathA][srcA]; do { resR = (DstPix >> 16) & 0xff; resG = (DstPix >> 8) & 0xff; resB = (DstPix >> 0) & 0xff; } while (0); do { resR = mul8table[dstF][resR] + mul8table[pathA][srcR]; resG = mul8table[dstF][resG] + mul8table[pathA][srcG]; resB = mul8table[dstF][resB] + mul8table[pathA][srcB]; } while (0); if (!(0) && resA && resA < 0xff) { do { resR = div8table[resA][resR]; resG = div8table[resA][resG]; resB = div8table[resA][resB]; } while (0); } (pRas)[0] = (((((((resA) << 8) | (resR)) << 8) | (resG)) << 8) | (resB)); } } pRas = ((void *) (((intptr_t) (pRas)) + (4))); ; } while (--w > 0); pRas = ((void *) (((intptr_t) (pRas)) + (rasScan))); ; pMask = ((void *) (((intptr_t) (pMask)) + (maskScan))); } while (--height > 0); } else { do { jint w = width; ; do { (pRas)[0] = (fgColor); pRas = ((void *) (((intptr_t) (pRas)) + (4))); ; } while (--w > 0); pRas = ((void *) (((intptr_t) (pRas)) + (rasScan))); ; } while (--height > 0); } } It seems that alpha blending macros are quite complex and can not be vectorized: Analyzing loop at IntArgb.c:109 IntArgb.c:109: note: not vectorized: control flow in loop. IntArgb.c:109: note: bad inner-loop form. IntArgb.c:109: note: not vectorized: Bad inner loop. IntArgb.c:109: note: bad loop form. Analyzing loop at IntArgb.c:109 IntArgb.c:109: note: not vectorized: control flow in loop. IntArgb.c:109: note: bad loop form. Analyzing loop at IntArgb.c:109 IntArgb.c:109: note: failed: evolution of base is not affine. IntArgb.c:109: note: bad data references. Analyzing loop at IntArgb.c:109 IntArgb.c:109: note: Unknown misalignment, is_packed = 0 IntArgb.c:109: note: virtual phi. skip. IntArgb.c:109: note: not vectorized: value used after loop. IntArgb.c:109: note: bad operation or unsupported loop bound. IntArgb.c:109: note: vectorized 0 loops in function. IntArgb.c:109: note: not consecutive access rasScan_26 = pRasInfo_25(D)->scanStride; IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. IntArgb.c:109: note: Unknown alignment for access: mul8table IntArgb.c:109: note: not consecutive access _40 = mul8table[srcA_36][srcB_33]; IntArgb.c:109: note: not consecutive access _42 = mul8table[srcA_36][srcB_31]; IntArgb.c:109: note: not consecutive access _44 = mul8table[srcA_36][srcB_29]; IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. IntArgb.c:109: note: SLP: step doesn't divide the vector-size. IntArgb.c:109: note: Unknown alignment for access: *pMask_1 IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. IntArgb.c:109: note: SLP: step doesn't divide the vector-size. IntArgb.c:109: note: Unknown alignment for access: *rasBase_9 IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: SLP: step doesn't divide the vector-size. IntArgb.c:109: note: Unknown alignment for access: *rasBase_9 IntArgb.c:109: note: Unknown alignment for access: mul8table IntArgb.c:109: note: not consecutive access _65 = mul8table[dstF_60][resA_64]; IntArgb.c:109: note: not consecutive access _67 = mul8table[pathA_58][srcA_36]; IntArgb.c:109: note: not consecutive access _75 = mul8table[dstF_66][resR_71]; IntArgb.c:109: note: not consecutive access _77 = mul8table[pathA_58][srcB_6]; IntArgb.c:109: note: not consecutive access _80 = mul8table[dstF_66][resG_73]; IntArgb.c:109: note: not consecutive access _82 = mul8table[pathA_58][srcB_7]; IntArgb.c:109: note: not consecutive access _85 = mul8table[dstF_66][resB_74]; IntArgb.c:109: note: not consecutive access _87 = mul8table[pathA_58][srcB_8]; IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: Unknown alignment for access: div8table IntArgb.c:109: note: not consecutive access _93 = div8table[resA_69][resR_79]; IntArgb.c:109: note: not consecutive access _95 = div8table[resA_69][resG_84]; IntArgb.c:109: note: not consecutive access _97 = div8table[resA_69][resB_89]; IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: SLP: step doesn't divide the vector-size. IntArgb.c:109: note: Unknown alignment for access: *rasBase_9 IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. IntArgb.c:109: note: SLP: step doesn't divide the vector-size. IntArgb.c:109: note: Unknown alignment for access: *rasBase_11 IntArgb.c:109: note: Failed to SLP the basic block. IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in basic block. IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. Any idea to make such code faster ? or to make it work with vectorization ? Finally I noticed that the macros with Lcd suffix seems to perform proper gamma corrections: void IntArgbDrawGlyphListLCD(SurfaceDataRasInfo *pRasInfo, ImageRef *glyphs, jint totalGlyphs, jint fgpixel, jint argbcolor, jint clipLeft, jint clipTop, jint clipRight, jint clipBottom, jint rgbOrder, unsigned char *gammaLut, unsigned char * invGammaLut, NativePrimitive *pPrim, CompositeInfo *pCompInfo) ... srcR = invGammaLut[srcR]; srcG = invGammaLut[srcG]; srcB = invGammaLut[srcB]; ... alpha blending ... dstR = gammaLut[dstR]; dstG = gammaLut[dstG]; dstB = gammaLut[dstB]; That's exactly what I want to implement the correct gamma correction in mask fill operations (shape draw / fill) for software loops (buffered image rendering). I will try now to figure out how that C code is generated by the nested macros ! Laurent -------------- next part -------------- An HTML attachment was scrubbed... URL: From Sergey.Bylokhov at oracle.com Fri Jan 15 21:49:22 2016 From: Sergey.Bylokhov at oracle.com (Sergey Bylokhov) Date: Sat, 16 Jan 2016 00:49:22 +0300 Subject: [OpenJDK Rasterizer] AWT & gcc 4.8 optimization options In-Reply-To: References: <5627E1B0.4060206@oracle.com> <5658396E.2090605@oracle.com> <56683DF8.1020607@oracle.com> Message-ID: <56996962.2090304@oracle.com> Hi, I found that in case of vectorisation on of the main hotspot is out table lookup pattern: mul8table/div8table which cannot be vectorized. Another hotspot is a many conditions inside the main loops. On 15/01/16 20:14, Laurent Bourg?s wrote: > Sergey, > > Did you made any progress ? > > I finally looked at the preprocessed C code and also enabled > ftree-vectorizer-verbose output: > CFLAGS := -save-temps -ftree-vectorize -ftree-vectorizer-verbose=2 > $(CFLAGS_JDKLIB) $(LIBAWT_CFLAGS), \ > > > I looked at the IntArgbPreSrcMaskFill hotspot (in my EllipseFillTest) > according to oprofile: > samples % image name symbol name > 469141 30.0043 libawt.so IntArgbPreSrcMaskFill > > > Here is the preprocessed C code: > - It is still complex to read as there are many do { } while (0) blocks > due to macro expansion... > > void IntArgbSrcMaskFill (void *rasBase, jubyte *pMask, jint maskOff, > jint maskScan, jint width, jint height, jint fgColor, SurfaceDataRasInfo > *pRasInfo, NativePrimitive *pPrim, CompositeInfo *pCompInfo) > { > jint srcA; > jint srcR, srcG, srcB; > jint rasScan = pRasInfo->scanStride; > IntArgbDataType *pRas = (IntArgbDataType *) (rasBase); > jint DstPix; > do > { > (srcB) = (fgColor) & 0xff; > (srcG) = ((fgColor) >> 8) & 0xff; > (srcR) = ((fgColor) >> 16) & 0xff; > (srcA) = ((fgColor) >> 24) & 0xff; > } > while (0); > if (srcA == 0) > { > srcR = srcG = srcB = 0; > fgColor = 0; > } > else > { > if (!(0)) > { > fgColor = (srcA << 24) | (fgColor & 0x00ffffff); > ; > } > if (srcA != 0xff) > { > do > { > srcR = mul8table[srcA][srcR]; > srcG = mul8table[srcA][srcG]; > srcB = mul8table[srcA][srcB]; > } > while (0); > } > if (0) > { > ; > } > } > DstPix = 0; > ; > rasScan -= width * 4; > if (pMask) > { > pMask += maskOff; > maskScan -= width; > do > { > jint w = width; > ; > do > { > jint resA; > jint resR, resG, resB; > jint dstF; > jint pathA = *pMask++; > if (pathA > 0) > { > if (pathA == 0xff) > { > (pRas)[0] = (fgColor); > } > else > { > ; > dstF = 0xff - pathA; > do > { > DstPix = (pRas)[0]; > resA = ((juint) DstPix) >> 24; > } > while (0); > resA = mul8table[dstF][resA]; > if (!(0)) > { > dstF = resA; > } > resA += mul8table[pathA][srcA]; > do > { > resR = (DstPix >> 16) & 0xff; > resG = (DstPix >> 8) & 0xff; > resB = (DstPix >> 0) & 0xff; > } > while (0); > do > { > resR = mul8table[dstF][resR] + > mul8table[pathA][srcR]; > resG = mul8table[dstF][resG] + > mul8table[pathA][srcG]; > resB = mul8table[dstF][resB] + > mul8table[pathA][srcB]; > } > while (0); > if (!(0) && resA && resA < 0xff) > { > do > { > resR = div8table[resA][resR]; > resG = div8table[resA][resG]; > resB = div8table[resA][resB]; > } > while (0); > } > (pRas)[0] = (((((((resA) << 8) | (resR)) << 8) > | (resG)) << 8) | (resB)); > } > } > pRas = ((void *) (((intptr_t) (pRas)) + (4))); > ; > } > while (--w > 0); > pRas = ((void *) (((intptr_t) (pRas)) + (rasScan))); > ; > pMask = ((void *) (((intptr_t) (pMask)) + (maskScan))); > } > while (--height > 0); > } > else > { > do > { > jint w = width; > ; > do > { > (pRas)[0] = (fgColor); > pRas = ((void *) (((intptr_t) (pRas)) + (4))); > ; > } > while (--w > 0); > pRas = ((void *) (((intptr_t) (pRas)) + (rasScan))); > ; > } > while (--height > 0); > } > } > > It seems that alpha blending macros are quite complex and can not be > vectorized: > > Analyzing loop at IntArgb.c:109 > IntArgb.c:109: note: not vectorized: control flow in loop. > IntArgb.c:109: note: bad inner-loop form. > IntArgb.c:109: note: not vectorized: Bad inner loop. > IntArgb.c:109: note: bad loop form. > Analyzing loop at IntArgb.c:109 > IntArgb.c:109: note: not vectorized: control flow in loop. > IntArgb.c:109: note: bad loop form. > Analyzing loop at IntArgb.c:109 > IntArgb.c:109: note: failed: evolution of base is not affine. > IntArgb.c:109: note: bad data references. > Analyzing loop at IntArgb.c:109 > IntArgb.c:109: note: Unknown misalignment, is_packed = 0 > IntArgb.c:109: note: virtual phi. skip. > IntArgb.c:109: note: not vectorized: value used after loop. > IntArgb.c:109: note: bad operation or unsupported loop bound. > IntArgb.c:109: note: vectorized 0 loops in function. > IntArgb.c:109: note: not consecutive access rasScan_26 = > pRasInfo_25(D)->scanStride; > IntArgb.c:109: note: Failed to SLP the basic block. > IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in > basic block. > IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. > IntArgb.c:109: note: Unknown alignment for access: mul8table > IntArgb.c:109: note: not consecutive access _40 = > mul8table[srcA_36][srcB_33]; > IntArgb.c:109: note: not consecutive access _42 = > mul8table[srcA_36][srcB_31]; > IntArgb.c:109: note: not consecutive access _44 = > mul8table[srcA_36][srcB_29]; > IntArgb.c:109: note: Failed to SLP the basic block. > IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in > basic block. > IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. > IntArgb.c:109: note: SLP: step doesn't divide the vector-size. > IntArgb.c:109: note: Unknown alignment for access: *pMask_1 > IntArgb.c:109: note: Failed to SLP the basic block. > IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in > basic block. > IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. > IntArgb.c:109: note: SLP: step doesn't divide the vector-size. > IntArgb.c:109: note: Unknown alignment for access: *rasBase_9 > IntArgb.c:109: note: Failed to SLP the basic block. > IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in > basic block. > IntArgb.c:109: note: SLP: step doesn't divide the vector-size. > IntArgb.c:109: note: Unknown alignment for access: *rasBase_9 > IntArgb.c:109: note: Unknown alignment for access: mul8table > IntArgb.c:109: note: not consecutive access _65 = > mul8table[dstF_60][resA_64]; > IntArgb.c:109: note: not consecutive access _67 = > mul8table[pathA_58][srcA_36]; > IntArgb.c:109: note: not consecutive access _75 = > mul8table[dstF_66][resR_71]; > IntArgb.c:109: note: not consecutive access _77 = > mul8table[pathA_58][srcB_6]; > IntArgb.c:109: note: not consecutive access _80 = > mul8table[dstF_66][resG_73]; > IntArgb.c:109: note: not consecutive access _82 = > mul8table[pathA_58][srcB_7]; > IntArgb.c:109: note: not consecutive access _85 = > mul8table[dstF_66][resB_74]; > IntArgb.c:109: note: not consecutive access _87 = > mul8table[pathA_58][srcB_8]; > IntArgb.c:109: note: Failed to SLP the basic block. > IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in > basic block. > IntArgb.c:109: note: Unknown alignment for access: div8table > IntArgb.c:109: note: not consecutive access _93 = > div8table[resA_69][resR_79]; > IntArgb.c:109: note: not consecutive access _95 = > div8table[resA_69][resG_84]; > IntArgb.c:109: note: not consecutive access _97 = > div8table[resA_69][resB_89]; > IntArgb.c:109: note: Failed to SLP the basic block. > IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in > basic block. > IntArgb.c:109: note: SLP: step doesn't divide the vector-size. > IntArgb.c:109: note: Unknown alignment for access: *rasBase_9 > IntArgb.c:109: note: Failed to SLP the basic block. > IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in > basic block. > IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. > IntArgb.c:109: note: SLP: step doesn't divide the vector-size. > IntArgb.c:109: note: Unknown alignment for access: *rasBase_11 > IntArgb.c:109: note: Failed to SLP the basic block. > IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in > basic block. > IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. > > > Any idea to make such code faster ? or to make it work with vectorization ? > > > Finally I noticed that the macros with Lcd suffix seems to perform > proper gamma corrections: > > void IntArgbDrawGlyphListLCD(SurfaceDataRasInfo *pRasInfo, ImageRef > *glyphs, jint totalGlyphs, jint fgpixel, jint argbcolor, jint clipLeft, > jint clipTop, jint clipRight, jint clipBottom, jint rgbOrder, unsigned > char *gammaLut, unsigned char * invGammaLut, NativePrimitive *pPrim, > CompositeInfo *pCompInfo) > ... > srcR = invGammaLut[srcR]; > srcG = invGammaLut[srcG]; > srcB = invGammaLut[srcB]; > ... > alpha blending > ... > dstR = gammaLut[dstR]; > dstG = gammaLut[dstG]; > dstB = gammaLut[dstB]; > > That's exactly what I want to implement the correct gamma correction in > mask fill operations (shape draw / fill) for software loops (buffered > image rendering). > > I will try now to figure out how that C code is generated by the nested > macros ! > > Laurent -- Best regards, Sergey. From james.graham at oracle.com Fri Jan 15 22:34:19 2016 From: james.graham at oracle.com (Jim Graham) Date: Fri, 15 Jan 2016 14:34:19 -0800 Subject: [OpenJDK Rasterizer] AWT & gcc 4.8 optimization options In-Reply-To: <56996962.2090304@oracle.com> References: <5627E1B0.4060206@oracle.com> <5658396E.2090605@oracle.com> <56683DF8.1020607@oracle.com> <56996962.2090304@oracle.com> Message-ID: <569973EB.3050509@oracle.com> The lookups were written in 1997-ish when processors had different vectorization/computation tradeoffs. It might be interesting to investigate a non-table version of the macros and see how the performance differs... ...jim On 1/15/16 1:49 PM, Sergey Bylokhov wrote: > Hi, > > I found that in case of vectorisation on of the main hotspot is out > table lookup pattern: mul8table/div8table which cannot be vectorized. > Another hotspot is a many conditions inside the main loops. > > On 15/01/16 20:14, Laurent Bourg?s wrote: >> Sergey, >> >> Did you made any progress ? >> >> I finally looked at the preprocessed C code and also enabled >> ftree-vectorizer-verbose output: >> CFLAGS := -save-temps -ftree-vectorize -ftree-vectorizer-verbose=2 >> $(CFLAGS_JDKLIB) $(LIBAWT_CFLAGS), \ >> >> >> I looked at the IntArgbPreSrcMaskFill hotspot (in my EllipseFillTest) >> according to oprofile: >> samples % image name symbol name >> 469141 30.0043 libawt.so IntArgbPreSrcMaskFill >> >> >> Here is the preprocessed C code: >> - It is still complex to read as there are many do { } while (0) blocks >> due to macro expansion... >> >> void IntArgbSrcMaskFill (void *rasBase, jubyte *pMask, jint maskOff, >> jint maskScan, jint width, jint height, jint fgColor, SurfaceDataRasInfo >> *pRasInfo, NativePrimitive *pPrim, CompositeInfo *pCompInfo) >> { >> jint srcA; >> jint srcR, srcG, srcB; >> jint rasScan = pRasInfo->scanStride; >> IntArgbDataType *pRas = (IntArgbDataType *) (rasBase); >> jint DstPix; >> do >> { >> (srcB) = (fgColor) & 0xff; >> (srcG) = ((fgColor) >> 8) & 0xff; >> (srcR) = ((fgColor) >> 16) & 0xff; >> (srcA) = ((fgColor) >> 24) & 0xff; >> } >> while (0); >> if (srcA == 0) >> { >> srcR = srcG = srcB = 0; >> fgColor = 0; >> } >> else >> { >> if (!(0)) >> { >> fgColor = (srcA << 24) | (fgColor & 0x00ffffff); >> ; >> } >> if (srcA != 0xff) >> { >> do >> { >> srcR = mul8table[srcA][srcR]; >> srcG = mul8table[srcA][srcG]; >> srcB = mul8table[srcA][srcB]; >> } >> while (0); >> } >> if (0) >> { >> ; >> } >> } >> DstPix = 0; >> ; >> rasScan -= width * 4; >> if (pMask) >> { >> pMask += maskOff; >> maskScan -= width; >> do >> { >> jint w = width; >> ; >> do >> { >> jint resA; >> jint resR, resG, resB; >> jint dstF; >> jint pathA = *pMask++; >> if (pathA > 0) >> { >> if (pathA == 0xff) >> { >> (pRas)[0] = (fgColor); >> } >> else >> { >> ; >> dstF = 0xff - pathA; >> do >> { >> DstPix = (pRas)[0]; >> resA = ((juint) DstPix) >> 24; >> } >> while (0); >> resA = mul8table[dstF][resA]; >> if (!(0)) >> { >> dstF = resA; >> } >> resA += mul8table[pathA][srcA]; >> do >> { >> resR = (DstPix >> 16) & 0xff; >> resG = (DstPix >> 8) & 0xff; >> resB = (DstPix >> 0) & 0xff; >> } >> while (0); >> do >> { >> resR = mul8table[dstF][resR] + >> mul8table[pathA][srcR]; >> resG = mul8table[dstF][resG] + >> mul8table[pathA][srcG]; >> resB = mul8table[dstF][resB] + >> mul8table[pathA][srcB]; >> } >> while (0); >> if (!(0) && resA && resA < 0xff) >> { >> do >> { >> resR = div8table[resA][resR]; >> resG = div8table[resA][resG]; >> resB = div8table[resA][resB]; >> } >> while (0); >> } >> (pRas)[0] = (((((((resA) << 8) | (resR)) << 8) >> | (resG)) << 8) | (resB)); >> } >> } >> pRas = ((void *) (((intptr_t) (pRas)) + (4))); >> ; >> } >> while (--w > 0); >> pRas = ((void *) (((intptr_t) (pRas)) + (rasScan))); >> ; >> pMask = ((void *) (((intptr_t) (pMask)) + (maskScan))); >> } >> while (--height > 0); >> } >> else >> { >> do >> { >> jint w = width; >> ; >> do >> { >> (pRas)[0] = (fgColor); >> pRas = ((void *) (((intptr_t) (pRas)) + (4))); >> ; >> } >> while (--w > 0); >> pRas = ((void *) (((intptr_t) (pRas)) + (rasScan))); >> ; >> } >> while (--height > 0); >> } >> } >> >> It seems that alpha blending macros are quite complex and can not be >> vectorized: >> >> Analyzing loop at IntArgb.c:109 >> IntArgb.c:109: note: not vectorized: control flow in loop. >> IntArgb.c:109: note: bad inner-loop form. >> IntArgb.c:109: note: not vectorized: Bad inner loop. >> IntArgb.c:109: note: bad loop form. >> Analyzing loop at IntArgb.c:109 >> IntArgb.c:109: note: not vectorized: control flow in loop. >> IntArgb.c:109: note: bad loop form. >> Analyzing loop at IntArgb.c:109 >> IntArgb.c:109: note: failed: evolution of base is not affine. >> IntArgb.c:109: note: bad data references. >> Analyzing loop at IntArgb.c:109 >> IntArgb.c:109: note: Unknown misalignment, is_packed = 0 >> IntArgb.c:109: note: virtual phi. skip. >> IntArgb.c:109: note: not vectorized: value used after loop. >> IntArgb.c:109: note: bad operation or unsupported loop bound. >> IntArgb.c:109: note: vectorized 0 loops in function. >> IntArgb.c:109: note: not consecutive access rasScan_26 = >> pRasInfo_25(D)->scanStride; >> IntArgb.c:109: note: Failed to SLP the basic block. >> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in >> basic block. >> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. >> IntArgb.c:109: note: Unknown alignment for access: mul8table >> IntArgb.c:109: note: not consecutive access _40 = >> mul8table[srcA_36][srcB_33]; >> IntArgb.c:109: note: not consecutive access _42 = >> mul8table[srcA_36][srcB_31]; >> IntArgb.c:109: note: not consecutive access _44 = >> mul8table[srcA_36][srcB_29]; >> IntArgb.c:109: note: Failed to SLP the basic block. >> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in >> basic block. >> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. >> IntArgb.c:109: note: SLP: step doesn't divide the vector-size. >> IntArgb.c:109: note: Unknown alignment for access: *pMask_1 >> IntArgb.c:109: note: Failed to SLP the basic block. >> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in >> basic block. >> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. >> IntArgb.c:109: note: SLP: step doesn't divide the vector-size. >> IntArgb.c:109: note: Unknown alignment for access: *rasBase_9 >> IntArgb.c:109: note: Failed to SLP the basic block. >> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in >> basic block. >> IntArgb.c:109: note: SLP: step doesn't divide the vector-size. >> IntArgb.c:109: note: Unknown alignment for access: *rasBase_9 >> IntArgb.c:109: note: Unknown alignment for access: mul8table >> IntArgb.c:109: note: not consecutive access _65 = >> mul8table[dstF_60][resA_64]; >> IntArgb.c:109: note: not consecutive access _67 = >> mul8table[pathA_58][srcA_36]; >> IntArgb.c:109: note: not consecutive access _75 = >> mul8table[dstF_66][resR_71]; >> IntArgb.c:109: note: not consecutive access _77 = >> mul8table[pathA_58][srcB_6]; >> IntArgb.c:109: note: not consecutive access _80 = >> mul8table[dstF_66][resG_73]; >> IntArgb.c:109: note: not consecutive access _82 = >> mul8table[pathA_58][srcB_7]; >> IntArgb.c:109: note: not consecutive access _85 = >> mul8table[dstF_66][resB_74]; >> IntArgb.c:109: note: not consecutive access _87 = >> mul8table[pathA_58][srcB_8]; >> IntArgb.c:109: note: Failed to SLP the basic block. >> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in >> basic block. >> IntArgb.c:109: note: Unknown alignment for access: div8table >> IntArgb.c:109: note: not consecutive access _93 = >> div8table[resA_69][resR_79]; >> IntArgb.c:109: note: not consecutive access _95 = >> div8table[resA_69][resG_84]; >> IntArgb.c:109: note: not consecutive access _97 = >> div8table[resA_69][resB_89]; >> IntArgb.c:109: note: Failed to SLP the basic block. >> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in >> basic block. >> IntArgb.c:109: note: SLP: step doesn't divide the vector-size. >> IntArgb.c:109: note: Unknown alignment for access: *rasBase_9 >> IntArgb.c:109: note: Failed to SLP the basic block. >> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in >> basic block. >> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. >> IntArgb.c:109: note: SLP: step doesn't divide the vector-size. >> IntArgb.c:109: note: Unknown alignment for access: *rasBase_11 >> IntArgb.c:109: note: Failed to SLP the basic block. >> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in >> basic block. >> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block. >> >> >> Any idea to make such code faster ? or to make it work with >> vectorization ? >> >> >> Finally I noticed that the macros with Lcd suffix seems to perform >> proper gamma corrections: >> >> void IntArgbDrawGlyphListLCD(SurfaceDataRasInfo *pRasInfo, ImageRef >> *glyphs, jint totalGlyphs, jint fgpixel, jint argbcolor, jint clipLeft, >> jint clipTop, jint clipRight, jint clipBottom, jint rgbOrder, unsigned >> char *gammaLut, unsigned char * invGammaLut, NativePrimitive *pPrim, >> CompositeInfo *pCompInfo) >> ... >> srcR = invGammaLut[srcR]; >> srcG = invGammaLut[srcG]; >> srcB = invGammaLut[srcB]; >> ... >> alpha blending >> ... >> dstR = gammaLut[dstR]; >> dstG = gammaLut[dstG]; >> dstB = gammaLut[dstB]; >> >> That's exactly what I want to implement the correct gamma correction in >> mask fill operations (shape draw / fill) for software loops (buffered >> image rendering). >> >> I will try now to figure out how that C code is generated by the nested >> macros ! >> >> Laurent > > From bourges.laurent at gmail.com Thu Jan 21 08:36:52 2016 From: bourges.laurent at gmail.com (=?UTF-8?Q?Laurent_Bourg=C3=A8s?=) Date: Thu, 21 Jan 2016 09:36:52 +0100 Subject: [OpenJDK Rasterizer] AWT & gcc 4.8 optimization options In-Reply-To: <569CE936.6010208@oracle.com> References: <5627E1B0.4060206@oracle.com> <5658396E.2090605@oracle.com> <56683DF8.1020607@oracle.com> <56996962.2090304@oracle.com> <569CE936.6010208@oracle.com> Message-ID: Sergey, >> So it looks scalar operations on vector (4) ie vectorization should be >> applicable. > > > yes, I think so. I googled a bit and it seems tricky to implement alpha blending with sse2 but many projects succeeded by using writing directly sse2 primitives ! >> Maybe the conditions (pathA > 0) && (pathA < 0xff) are a bigger penalty >> as they can not be easily predicted (but may happen often). >> Sometimes it is faster to perform useless math operations without >> branching (gpu approach). >> >> Do you have other ideas to make it faster ? as it represents 30% of the >> ellipse fill test (huge ellipses). >> I noticed that larger tiles (64x64) are a bit faster (larger tile width >> / height, less jni calls) > > > I just commented out some of the code inside this method and checks the performance. It seems that the simple code like: > inloop->readBytes->decodeRGB->encodeBytes->saveBytes is quite fast. But if some branch/multiplication are added after decodeRGB then the code became really slow(x10 slower on my system). This is expected because we complete huge number of multiplications, but if I try to make the same math standalone(without byte decoding) then the result is fast also. So it seems that we slow because of mixing of byteReading/branches/multipliation. It seems possible to for RGBA: - compute A+G and R+B together (2?16bits) to double the throughput - use bit shifts instead of mul / div Could you try implementing such variants ? >> Should I try (as I did in the past) to implement the MaskFill in Java to >> benefit from hotspot optimizations (like Marlin) ? > > > It will be interesting. I remember that someone already tried to do the same, but I do not remember the result. Probably Jim can suggest something. I implemented alpha blending in java last year (using custom composite operator hack): http://mail.openjdk.java.net/pipermail/2d-dev/2014-August/004751.html I could try soon optimizing my java impl... Cheers, Laurent -------------- next part -------------- An HTML attachment was scrubbed... URL: