[OpenJDK Rasterizer] AWT & gcc 4.8 optimization options

Sergey Bylokhov Sergey.Bylokhov at oracle.com
Fri Jan 15 21:49:22 UTC 2016


Hi,

I found that in case of vectorisation on of the main hotspot is out 
table lookup pattern: mul8table/div8table which cannot be vectorized. 
Another hotspot is a many conditions inside the main loops.

On 15/01/16 20:14, Laurent Bourgès wrote:
> Sergey,
>
> Did you made any progress ?
>
> I finally looked at the preprocessed C code and also enabled
> ftree-vectorizer-verbose output:
>      CFLAGS := -save-temps -ftree-vectorize -ftree-vectorizer-verbose=2
> $(CFLAGS_JDKLIB) $(LIBAWT_CFLAGS), \
>
>
> I looked at the IntArgbPreSrcMaskFill hotspot (in my EllipseFillTest)
> according to oprofile:
> samples  %        image name               symbol name
> 469141   30.0043  libawt.so                IntArgbPreSrcMaskFill
>
>
> Here is the preprocessed C code:
> - It is still complex to read as there are many do { } while (0) blocks
> due to macro expansion...
>
> void IntArgbSrcMaskFill (void *rasBase, jubyte *pMask, jint maskOff,
> jint maskScan, jint width, jint height, jint fgColor, SurfaceDataRasInfo
> *pRasInfo, NativePrimitive *pPrim, CompositeInfo *pCompInfo)
> {
>      jint srcA;
>      jint srcR, srcG, srcB;
>      jint rasScan = pRasInfo->scanStride;
>      IntArgbDataType *pRas = (IntArgbDataType *) (rasBase);
>      jint DstPix;
>      do
>      {
>          (srcB) = (fgColor) & 0xff;
>          (srcG) = ((fgColor) >> 8) & 0xff;
>          (srcR) = ((fgColor) >> 16) & 0xff;
>          (srcA) = ((fgColor) >> 24) & 0xff;
>      }
>      while (0);
>      if (srcA == 0)
>      {
>          srcR = srcG = srcB = 0;
>          fgColor = 0;
>      }
>      else
>      {
>          if (!(0))
>          {
>              fgColor = (srcA << 24) | (fgColor & 0x00ffffff);
>              ;
>          }
>          if (srcA != 0xff)
>          {
>              do
>              {
>                  srcR = mul8table[srcA][srcR];
>                  srcG = mul8table[srcA][srcG];
>                  srcB = mul8table[srcA][srcB];
>              }
>              while (0);
>          }
>          if (0)
>          {
>              ;
>          }
>      }
>      DstPix = 0;
>      ;
>      rasScan -= width * 4;
>      if (pMask)
>      {
>          pMask += maskOff;
>          maskScan -= width;
>          do
>          {
>              jint w = width;
>              ;
>              do
>              {
>                  jint resA;
>                  jint resR, resG, resB;
>                  jint dstF;
>                  jint pathA = *pMask++;
>                  if (pathA > 0)
>                  {
>                      if (pathA == 0xff)
>                      {
>                          (pRas)[0] = (fgColor);
>                      }
>                      else
>                      {
>                          ;
>                          dstF = 0xff - pathA;
>                          do
>                          {
>                              DstPix = (pRas)[0];
>                              resA = ((juint) DstPix) >> 24;
>                          }
>                          while (0);
>                          resA = mul8table[dstF][resA];
>                          if (!(0))
>                          {
>                              dstF = resA;
>                          }
>                          resA += mul8table[pathA][srcA];
>                          do
>                          {
>                              resR = (DstPix >> 16) & 0xff;
>                              resG = (DstPix >> 8) & 0xff;
>                              resB = (DstPix >> 0) & 0xff;
>                          }
>                          while (0);
>                          do
>                          {
>                              resR = mul8table[dstF][resR] +
> mul8table[pathA][srcR];
>                              resG = mul8table[dstF][resG] +
> mul8table[pathA][srcG];
>                              resB = mul8table[dstF][resB] +
> mul8table[pathA][srcB];
>                          }
>                          while (0);
>                          if (!(0) && resA && resA < 0xff)
>                          {
>                              do
>                              {
>                                  resR = div8table[resA][resR];
>                                  resG = div8table[resA][resG];
>                                  resB = div8table[resA][resB];
>                              }
>                              while (0);
>                          }
>                          (pRas)[0] = (((((((resA) << 8) | (resR)) << 8)
> | (resG)) << 8) | (resB));
>                      }
>                  }
>                  pRas = ((void *) (((intptr_t) (pRas)) + (4)));
>                  ;
>              }
>              while (--w > 0);
>              pRas = ((void *) (((intptr_t) (pRas)) + (rasScan)));
>              ;
>              pMask = ((void *) (((intptr_t) (pMask)) + (maskScan)));
>          }
>          while (--height > 0);
>      }
>      else
>      {
>          do
>          {
>              jint w = width;
>              ;
>              do
>              {
>                  (pRas)[0] = (fgColor);
>                  pRas = ((void *) (((intptr_t) (pRas)) + (4)));
>                  ;
>              }
>              while (--w > 0);
>              pRas = ((void *) (((intptr_t) (pRas)) + (rasScan)));
>              ;
>          }
>          while (--height > 0);
>      }
> }
>
> It seems that alpha blending macros are quite complex and can not be
> vectorized:
>
> Analyzing loop at IntArgb.c:109
> IntArgb.c:109: note: not vectorized: control flow in loop.
> IntArgb.c:109: note: bad inner-loop form.
> IntArgb.c:109: note: not vectorized: Bad inner loop.
> IntArgb.c:109: note: bad loop form.
> Analyzing loop at IntArgb.c:109
> IntArgb.c:109: note: not vectorized: control flow in loop.
> IntArgb.c:109: note: bad loop form.
> Analyzing loop at IntArgb.c:109
> IntArgb.c:109: note: failed: evolution of base is not affine.
> IntArgb.c:109: note: bad data references.
> Analyzing loop at IntArgb.c:109
> IntArgb.c:109: note: Unknown misalignment, is_packed = 0
> IntArgb.c:109: note: virtual phi. skip.
> IntArgb.c:109: note: not vectorized: value used after loop.
> IntArgb.c:109: note: bad operation or unsupported loop bound.
> IntArgb.c:109: note: vectorized 0 loops in function.
> IntArgb.c:109: note: not consecutive access rasScan_26 =
> pRasInfo_25(D)->scanStride;
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
> IntArgb.c:109: note: Unknown alignment for access: mul8table
> IntArgb.c:109: note: not consecutive access _40 =
> mul8table[srcA_36][srcB_33];
> IntArgb.c:109: note: not consecutive access _42 =
> mul8table[srcA_36][srcB_31];
> IntArgb.c:109: note: not consecutive access _44 =
> mul8table[srcA_36][srcB_29];
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
> IntArgb.c:109: note: Unknown alignment for access: *pMask_1
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
> IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
> IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
> IntArgb.c:109: note: Unknown alignment for access: mul8table
> IntArgb.c:109: note: not consecutive access _65 =
> mul8table[dstF_60][resA_64];
> IntArgb.c:109: note: not consecutive access _67 =
> mul8table[pathA_58][srcA_36];
> IntArgb.c:109: note: not consecutive access _75 =
> mul8table[dstF_66][resR_71];
> IntArgb.c:109: note: not consecutive access _77 =
> mul8table[pathA_58][srcB_6];
> IntArgb.c:109: note: not consecutive access _80 =
> mul8table[dstF_66][resG_73];
> IntArgb.c:109: note: not consecutive access _82 =
> mul8table[pathA_58][srcB_7];
> IntArgb.c:109: note: not consecutive access _85 =
> mul8table[dstF_66][resB_74];
> IntArgb.c:109: note: not consecutive access _87 =
> mul8table[pathA_58][srcB_8];
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: Unknown alignment for access: div8table
> IntArgb.c:109: note: not consecutive access _93 =
> div8table[resA_69][resR_79];
> IntArgb.c:109: note: not consecutive access _95 =
> div8table[resA_69][resG_84];
> IntArgb.c:109: note: not consecutive access _97 =
> div8table[resA_69][resB_89];
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
> IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
> IntArgb.c:109: note: Unknown alignment for access: *rasBase_11
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
>
>
> Any idea to make such code faster ? or to make it work with vectorization ?
>
>
> Finally I noticed that the macros with Lcd suffix seems to perform
> proper gamma corrections:
>
> void IntArgbDrawGlyphListLCD(SurfaceDataRasInfo *pRasInfo, ImageRef
> *glyphs, jint totalGlyphs, jint fgpixel, jint argbcolor, jint clipLeft,
> jint clipTop, jint clipRight, jint clipBottom, jint rgbOrder, unsigned
> char *gammaLut, unsigned char * invGammaLut, NativePrimitive *pPrim,
> CompositeInfo *pCompInfo)
> ...
>      srcR = invGammaLut[srcR];
>      srcG = invGammaLut[srcG];
>      srcB = invGammaLut[srcB];
> ...
> alpha blending
> ...
>      dstR = gammaLut[dstR];
>      dstG = gammaLut[dstG];
>      dstB = gammaLut[dstB];
>
> That's exactly what I want to implement the correct gamma correction in
> mask fill operations (shape draw / fill) for software loops (buffered
> image rendering).
>
> I will try now to figure out how that C code is generated by the nested
> macros !
>
> Laurent


-- 
Best regards, Sergey.


More information about the graphics-rasterizer-dev mailing list