[OpenJDK Rasterizer] AWT & gcc 4.8 optimization options
Sergey Bylokhov
Sergey.Bylokhov at oracle.com
Fri Jan 15 21:49:22 UTC 2016
Hi,
I found that in case of vectorisation on of the main hotspot is out
table lookup pattern: mul8table/div8table which cannot be vectorized.
Another hotspot is a many conditions inside the main loops.
On 15/01/16 20:14, Laurent Bourgès wrote:
> Sergey,
>
> Did you made any progress ?
>
> I finally looked at the preprocessed C code and also enabled
> ftree-vectorizer-verbose output:
> CFLAGS := -save-temps -ftree-vectorize -ftree-vectorizer-verbose=2
> $(CFLAGS_JDKLIB) $(LIBAWT_CFLAGS), \
>
>
> I looked at the IntArgbPreSrcMaskFill hotspot (in my EllipseFillTest)
> according to oprofile:
> samples % image name symbol name
> 469141 30.0043 libawt.so IntArgbPreSrcMaskFill
>
>
> Here is the preprocessed C code:
> - It is still complex to read as there are many do { } while (0) blocks
> due to macro expansion...
>
> void IntArgbSrcMaskFill (void *rasBase, jubyte *pMask, jint maskOff,
> jint maskScan, jint width, jint height, jint fgColor, SurfaceDataRasInfo
> *pRasInfo, NativePrimitive *pPrim, CompositeInfo *pCompInfo)
> {
> jint srcA;
> jint srcR, srcG, srcB;
> jint rasScan = pRasInfo->scanStride;
> IntArgbDataType *pRas = (IntArgbDataType *) (rasBase);
> jint DstPix;
> do
> {
> (srcB) = (fgColor) & 0xff;
> (srcG) = ((fgColor) >> 8) & 0xff;
> (srcR) = ((fgColor) >> 16) & 0xff;
> (srcA) = ((fgColor) >> 24) & 0xff;
> }
> while (0);
> if (srcA == 0)
> {
> srcR = srcG = srcB = 0;
> fgColor = 0;
> }
> else
> {
> if (!(0))
> {
> fgColor = (srcA << 24) | (fgColor & 0x00ffffff);
> ;
> }
> if (srcA != 0xff)
> {
> do
> {
> srcR = mul8table[srcA][srcR];
> srcG = mul8table[srcA][srcG];
> srcB = mul8table[srcA][srcB];
> }
> while (0);
> }
> if (0)
> {
> ;
> }
> }
> DstPix = 0;
> ;
> rasScan -= width * 4;
> if (pMask)
> {
> pMask += maskOff;
> maskScan -= width;
> do
> {
> jint w = width;
> ;
> do
> {
> jint resA;
> jint resR, resG, resB;
> jint dstF;
> jint pathA = *pMask++;
> if (pathA > 0)
> {
> if (pathA == 0xff)
> {
> (pRas)[0] = (fgColor);
> }
> else
> {
> ;
> dstF = 0xff - pathA;
> do
> {
> DstPix = (pRas)[0];
> resA = ((juint) DstPix) >> 24;
> }
> while (0);
> resA = mul8table[dstF][resA];
> if (!(0))
> {
> dstF = resA;
> }
> resA += mul8table[pathA][srcA];
> do
> {
> resR = (DstPix >> 16) & 0xff;
> resG = (DstPix >> 8) & 0xff;
> resB = (DstPix >> 0) & 0xff;
> }
> while (0);
> do
> {
> resR = mul8table[dstF][resR] +
> mul8table[pathA][srcR];
> resG = mul8table[dstF][resG] +
> mul8table[pathA][srcG];
> resB = mul8table[dstF][resB] +
> mul8table[pathA][srcB];
> }
> while (0);
> if (!(0) && resA && resA < 0xff)
> {
> do
> {
> resR = div8table[resA][resR];
> resG = div8table[resA][resG];
> resB = div8table[resA][resB];
> }
> while (0);
> }
> (pRas)[0] = (((((((resA) << 8) | (resR)) << 8)
> | (resG)) << 8) | (resB));
> }
> }
> pRas = ((void *) (((intptr_t) (pRas)) + (4)));
> ;
> }
> while (--w > 0);
> pRas = ((void *) (((intptr_t) (pRas)) + (rasScan)));
> ;
> pMask = ((void *) (((intptr_t) (pMask)) + (maskScan)));
> }
> while (--height > 0);
> }
> else
> {
> do
> {
> jint w = width;
> ;
> do
> {
> (pRas)[0] = (fgColor);
> pRas = ((void *) (((intptr_t) (pRas)) + (4)));
> ;
> }
> while (--w > 0);
> pRas = ((void *) (((intptr_t) (pRas)) + (rasScan)));
> ;
> }
> while (--height > 0);
> }
> }
>
> It seems that alpha blending macros are quite complex and can not be
> vectorized:
>
> Analyzing loop at IntArgb.c:109
> IntArgb.c:109: note: not vectorized: control flow in loop.
> IntArgb.c:109: note: bad inner-loop form.
> IntArgb.c:109: note: not vectorized: Bad inner loop.
> IntArgb.c:109: note: bad loop form.
> Analyzing loop at IntArgb.c:109
> IntArgb.c:109: note: not vectorized: control flow in loop.
> IntArgb.c:109: note: bad loop form.
> Analyzing loop at IntArgb.c:109
> IntArgb.c:109: note: failed: evolution of base is not affine.
> IntArgb.c:109: note: bad data references.
> Analyzing loop at IntArgb.c:109
> IntArgb.c:109: note: Unknown misalignment, is_packed = 0
> IntArgb.c:109: note: virtual phi. skip.
> IntArgb.c:109: note: not vectorized: value used after loop.
> IntArgb.c:109: note: bad operation or unsupported loop bound.
> IntArgb.c:109: note: vectorized 0 loops in function.
> IntArgb.c:109: note: not consecutive access rasScan_26 =
> pRasInfo_25(D)->scanStride;
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
> IntArgb.c:109: note: Unknown alignment for access: mul8table
> IntArgb.c:109: note: not consecutive access _40 =
> mul8table[srcA_36][srcB_33];
> IntArgb.c:109: note: not consecutive access _42 =
> mul8table[srcA_36][srcB_31];
> IntArgb.c:109: note: not consecutive access _44 =
> mul8table[srcA_36][srcB_29];
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
> IntArgb.c:109: note: Unknown alignment for access: *pMask_1
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
> IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
> IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
> IntArgb.c:109: note: Unknown alignment for access: mul8table
> IntArgb.c:109: note: not consecutive access _65 =
> mul8table[dstF_60][resA_64];
> IntArgb.c:109: note: not consecutive access _67 =
> mul8table[pathA_58][srcA_36];
> IntArgb.c:109: note: not consecutive access _75 =
> mul8table[dstF_66][resR_71];
> IntArgb.c:109: note: not consecutive access _77 =
> mul8table[pathA_58][srcB_6];
> IntArgb.c:109: note: not consecutive access _80 =
> mul8table[dstF_66][resG_73];
> IntArgb.c:109: note: not consecutive access _82 =
> mul8table[pathA_58][srcB_7];
> IntArgb.c:109: note: not consecutive access _85 =
> mul8table[dstF_66][resB_74];
> IntArgb.c:109: note: not consecutive access _87 =
> mul8table[pathA_58][srcB_8];
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: Unknown alignment for access: div8table
> IntArgb.c:109: note: not consecutive access _93 =
> div8table[resA_69][resR_79];
> IntArgb.c:109: note: not consecutive access _95 =
> div8table[resA_69][resG_84];
> IntArgb.c:109: note: not consecutive access _97 =
> div8table[resA_69][resB_89];
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
> IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
> IntArgb.c:109: note: Unknown alignment for access: *rasBase_11
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
>
>
> Any idea to make such code faster ? or to make it work with vectorization ?
>
>
> Finally I noticed that the macros with Lcd suffix seems to perform
> proper gamma corrections:
>
> void IntArgbDrawGlyphListLCD(SurfaceDataRasInfo *pRasInfo, ImageRef
> *glyphs, jint totalGlyphs, jint fgpixel, jint argbcolor, jint clipLeft,
> jint clipTop, jint clipRight, jint clipBottom, jint rgbOrder, unsigned
> char *gammaLut, unsigned char * invGammaLut, NativePrimitive *pPrim,
> CompositeInfo *pCompInfo)
> ...
> srcR = invGammaLut[srcR];
> srcG = invGammaLut[srcG];
> srcB = invGammaLut[srcB];
> ...
> alpha blending
> ...
> dstR = gammaLut[dstR];
> dstG = gammaLut[dstG];
> dstB = gammaLut[dstB];
>
> That's exactly what I want to implement the correct gamma correction in
> mask fill operations (shape draw / fill) for software loops (buffered
> image rendering).
>
> I will try now to figure out how that C code is generated by the nested
> macros !
>
> Laurent
--
Best regards, Sergey.
More information about the graphics-rasterizer-dev
mailing list