From bourges.laurent at gmail.com  Fri Jan 15 17:14:46 2016
From: bourges.laurent at gmail.com (=?UTF-8?Q?Laurent_Bourg=C3=A8s?=)
Date: Fri, 15 Jan 2016 18:14:46 +0100
Subject: [OpenJDK Rasterizer] AWT & gcc 4.8 optimization options
In-Reply-To: <CAKjRUT4GVH0ND+pHPHgex7izZVehLjip=fV7=PrSG6C2yDXFzA@mail.gmail.com>
References: <CAKjRUT7qmxibs-YhHSyoHs4aCR8N=PvadvqhDKfMWeC0iEF-wQ@mail.gmail.com>
	<CAKjRUT7oYaq7saQd591oOOKP28fJPhrQFxst_X9N=_xGU72T+w@mail.gmail.com>
	<CAKjRUT5Bod6br0M4trREcD=uZkbLXxNio5u89yruWCjgG7Owew@mail.gmail.com>
	<CAKjRUT7BEMJu5HF36xDLv2LNs8xSUF420O92W0Y=RFAjPZy+kg@mail.gmail.com>
	<5627E1B0.4060206@oracle.com>
	<CAKjRUT7yFdMTCgRJV3Fj066CLSzn4_DBiHUC3kBbYtcmfufBYA@mail.gmail.com>
	<5658396E.2090605@oracle.com>
	<CAKjRUT6_Dk-yw0c87nfG1+3RNhtja-ihV+-OsuE7wcAn63Vhvw@mail.gmail.com>
	<56683DF8.1020607@oracle.com>
	<CAKjRUT4GVH0ND+pHPHgex7izZVehLjip=fV7=PrSG6C2yDXFzA@mail.gmail.com>
Message-ID: <CAKjRUT67V6pSEuyYTJ9wUOD7=S7WuqTjwazNbbimE+PYiaB=NA@mail.gmail.com>

Sergey,

Did you made any progress ?

I finally looked at the preprocessed C code and also enabled
ftree-vectorizer-verbose output:
    CFLAGS := -save-temps -ftree-vectorize -ftree-vectorizer-verbose=2
$(CFLAGS_JDKLIB) $(LIBAWT_CFLAGS), \


I looked at the IntArgbPreSrcMaskFill hotspot (in my EllipseFillTest)
according to oprofile:
samples  %        image name               symbol name
469141   30.0043  libawt.so                IntArgbPreSrcMaskFill


Here is the preprocessed C code:
- It is still complex to read as there are many do { } while (0) blocks due
to macro expansion...

void IntArgbSrcMaskFill (void *rasBase, jubyte *pMask, jint maskOff, jint
maskScan, jint width, jint height, jint fgColor, SurfaceDataRasInfo
*pRasInfo, NativePrimitive *pPrim, CompositeInfo *pCompInfo)
{
    jint srcA;
    jint srcR, srcG, srcB;
    jint rasScan = pRasInfo->scanStride;
    IntArgbDataType *pRas = (IntArgbDataType *) (rasBase);
    jint DstPix;
    do
    {
        (srcB) = (fgColor) & 0xff;
        (srcG) = ((fgColor) >> 8) & 0xff;
        (srcR) = ((fgColor) >> 16) & 0xff;
        (srcA) = ((fgColor) >> 24) & 0xff;
    }
    while (0);
    if (srcA == 0)
    {
        srcR = srcG = srcB = 0;
        fgColor = 0;
    }
    else
    {
        if (!(0))
        {
            fgColor = (srcA << 24) | (fgColor & 0x00ffffff);
            ;
        }
        if (srcA != 0xff)
        {
            do
            {
                srcR = mul8table[srcA][srcR];
                srcG = mul8table[srcA][srcG];
                srcB = mul8table[srcA][srcB];
            }
            while (0);
        }
        if (0)
        {
            ;
        }
    }
    DstPix = 0;
    ;
    rasScan -= width * 4;
    if (pMask)
    {
        pMask += maskOff;
        maskScan -= width;
        do
        {
            jint w = width;
            ;
            do
            {
                jint resA;
                jint resR, resG, resB;
                jint dstF;
                jint pathA = *pMask++;
                if (pathA > 0)
                {
                    if (pathA == 0xff)
                    {
                        (pRas)[0] = (fgColor);
                    }
                    else
                    {
                        ;
                        dstF = 0xff - pathA;
                        do
                        {
                            DstPix = (pRas)[0];
                            resA = ((juint) DstPix) >> 24;
                        }
                        while (0);
                        resA = mul8table[dstF][resA];
                        if (!(0))
                        {
                            dstF = resA;
                        }
                        resA += mul8table[pathA][srcA];
                        do
                        {
                            resR = (DstPix >> 16) & 0xff;
                            resG = (DstPix >> 8) & 0xff;
                            resB = (DstPix >> 0) & 0xff;
                        }
                        while (0);
                        do
                        {
                            resR = mul8table[dstF][resR] +
mul8table[pathA][srcR];
                            resG = mul8table[dstF][resG] +
mul8table[pathA][srcG];
                            resB = mul8table[dstF][resB] +
mul8table[pathA][srcB];
                        }
                        while (0);
                        if (!(0) && resA && resA < 0xff)
                        {
                            do
                            {
                                resR = div8table[resA][resR];
                                resG = div8table[resA][resG];
                                resB = div8table[resA][resB];
                            }
                            while (0);
                        }
                        (pRas)[0] = (((((((resA) << 8) | (resR)) << 8) |
(resG)) << 8) | (resB));
                    }
                }
                pRas = ((void *) (((intptr_t) (pRas)) + (4)));
                ;
            }
            while (--w > 0);
            pRas = ((void *) (((intptr_t) (pRas)) + (rasScan)));
            ;
            pMask = ((void *) (((intptr_t) (pMask)) + (maskScan)));
        }
        while (--height > 0);
    }
    else
    {
        do
        {
            jint w = width;
            ;
            do
            {
                (pRas)[0] = (fgColor);
                pRas = ((void *) (((intptr_t) (pRas)) + (4)));
                ;
            }
            while (--w > 0);
            pRas = ((void *) (((intptr_t) (pRas)) + (rasScan)));
            ;
        }
        while (--height > 0);
    }
}

It seems that alpha blending macros are quite complex and can not be
vectorized:

Analyzing loop at IntArgb.c:109
IntArgb.c:109: note: not vectorized: control flow in loop.
IntArgb.c:109: note: bad inner-loop form.
IntArgb.c:109: note: not vectorized: Bad inner loop.
IntArgb.c:109: note: bad loop form.
Analyzing loop at IntArgb.c:109
IntArgb.c:109: note: not vectorized: control flow in loop.
IntArgb.c:109: note: bad loop form.
Analyzing loop at IntArgb.c:109
IntArgb.c:109: note: failed: evolution of base is not affine.
IntArgb.c:109: note: bad data references.
Analyzing loop at IntArgb.c:109
IntArgb.c:109: note: Unknown misalignment, is_packed = 0
IntArgb.c:109: note: virtual phi. skip.
IntArgb.c:109: note: not vectorized: value used after loop.
IntArgb.c:109: note: bad operation or unsupported loop bound.
IntArgb.c:109: note: vectorized 0 loops in function.
IntArgb.c:109: note: not consecutive access rasScan_26 =
pRasInfo_25(D)->scanStride;
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
IntArgb.c:109: note: Unknown alignment for access: mul8table
IntArgb.c:109: note: not consecutive access _40 =
mul8table[srcA_36][srcB_33];
IntArgb.c:109: note: not consecutive access _42 =
mul8table[srcA_36][srcB_31];
IntArgb.c:109: note: not consecutive access _44 =
mul8table[srcA_36][srcB_29];
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
IntArgb.c:109: note: Unknown alignment for access: *pMask_1
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
IntArgb.c:109: note: Unknown alignment for access: mul8table
IntArgb.c:109: note: not consecutive access _65 =
mul8table[dstF_60][resA_64];
IntArgb.c:109: note: not consecutive access _67 =
mul8table[pathA_58][srcA_36];
IntArgb.c:109: note: not consecutive access _75 =
mul8table[dstF_66][resR_71];
IntArgb.c:109: note: not consecutive access _77 =
mul8table[pathA_58][srcB_6];
IntArgb.c:109: note: not consecutive access _80 =
mul8table[dstF_66][resG_73];
IntArgb.c:109: note: not consecutive access _82 =
mul8table[pathA_58][srcB_7];
IntArgb.c:109: note: not consecutive access _85 =
mul8table[dstF_66][resB_74];
IntArgb.c:109: note: not consecutive access _87 =
mul8table[pathA_58][srcB_8];
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: Unknown alignment for access: div8table
IntArgb.c:109: note: not consecutive access _93 =
div8table[resA_69][resR_79];
IntArgb.c:109: note: not consecutive access _95 =
div8table[resA_69][resG_84];
IntArgb.c:109: note: not consecutive access _97 =
div8table[resA_69][resB_89];
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
IntArgb.c:109: note: Unknown alignment for access: *rasBase_11
IntArgb.c:109: note: Failed to SLP the basic block.
IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
basic block.
IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.


Any idea to make such code faster ? or to make it work with vectorization ?


Finally I noticed that the macros with Lcd suffix seems to perform proper
gamma corrections:

void IntArgbDrawGlyphListLCD(SurfaceDataRasInfo *pRasInfo, ImageRef
*glyphs, jint totalGlyphs, jint fgpixel, jint argbcolor, jint clipLeft,
jint clipTop, jint clipRight, jint clipBottom, jint rgbOrder, unsigned char
*gammaLut, unsigned char * invGammaLut, NativePrimitive *pPrim,
CompositeInfo *pCompInfo)
...
    srcR = invGammaLut[srcR];
    srcG = invGammaLut[srcG];
    srcB = invGammaLut[srcB];
...
alpha blending
...
    dstR = gammaLut[dstR];
    dstG = gammaLut[dstG];
    dstB = gammaLut[dstB];

That's exactly what I want to implement the correct gamma correction in
mask fill operations (shape draw / fill) for software loops (buffered image
rendering).

I will try now to figure out how that C code is generated by the nested
macros !

Laurent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/graphics-rasterizer-dev/attachments/20160115/c54c8ebe/attachment-0001.html>

From Sergey.Bylokhov at oracle.com  Fri Jan 15 21:49:22 2016
From: Sergey.Bylokhov at oracle.com (Sergey Bylokhov)
Date: Sat, 16 Jan 2016 00:49:22 +0300
Subject: [OpenJDK Rasterizer] AWT & gcc 4.8 optimization options
In-Reply-To: <CAKjRUT67V6pSEuyYTJ9wUOD7=S7WuqTjwazNbbimE+PYiaB=NA@mail.gmail.com>
References: <CAKjRUT7qmxibs-YhHSyoHs4aCR8N=PvadvqhDKfMWeC0iEF-wQ@mail.gmail.com>
	<CAKjRUT7oYaq7saQd591oOOKP28fJPhrQFxst_X9N=_xGU72T+w@mail.gmail.com>
	<CAKjRUT5Bod6br0M4trREcD=uZkbLXxNio5u89yruWCjgG7Owew@mail.gmail.com>
	<CAKjRUT7BEMJu5HF36xDLv2LNs8xSUF420O92W0Y=RFAjPZy+kg@mail.gmail.com>
	<5627E1B0.4060206@oracle.com>
	<CAKjRUT7yFdMTCgRJV3Fj066CLSzn4_DBiHUC3kBbYtcmfufBYA@mail.gmail.com>
	<5658396E.2090605@oracle.com>
	<CAKjRUT6_Dk-yw0c87nfG1+3RNhtja-ihV+-OsuE7wcAn63Vhvw@mail.gmail.com>
	<56683DF8.1020607@oracle.com>
	<CAKjRUT4GVH0ND+pHPHgex7izZVehLjip=fV7=PrSG6C2yDXFzA@mail.gmail.com>
	<CAKjRUT67V6pSEuyYTJ9wUOD7=S7WuqTjwazNbbimE+PYiaB=NA@mail.gmail.com>
Message-ID: <56996962.2090304@oracle.com>

Hi,

I found that in case of vectorisation on of the main hotspot is out 
table lookup pattern: mul8table/div8table which cannot be vectorized. 
Another hotspot is a many conditions inside the main loops.

On 15/01/16 20:14, Laurent Bourg?s wrote:
> Sergey,
>
> Did you made any progress ?
>
> I finally looked at the preprocessed C code and also enabled
> ftree-vectorizer-verbose output:
>      CFLAGS := -save-temps -ftree-vectorize -ftree-vectorizer-verbose=2
> $(CFLAGS_JDKLIB) $(LIBAWT_CFLAGS), \
>
>
> I looked at the IntArgbPreSrcMaskFill hotspot (in my EllipseFillTest)
> according to oprofile:
> samples  %        image name               symbol name
> 469141   30.0043  libawt.so                IntArgbPreSrcMaskFill
>
>
> Here is the preprocessed C code:
> - It is still complex to read as there are many do { } while (0) blocks
> due to macro expansion...
>
> void IntArgbSrcMaskFill (void *rasBase, jubyte *pMask, jint maskOff,
> jint maskScan, jint width, jint height, jint fgColor, SurfaceDataRasInfo
> *pRasInfo, NativePrimitive *pPrim, CompositeInfo *pCompInfo)
> {
>      jint srcA;
>      jint srcR, srcG, srcB;
>      jint rasScan = pRasInfo->scanStride;
>      IntArgbDataType *pRas = (IntArgbDataType *) (rasBase);
>      jint DstPix;
>      do
>      {
>          (srcB) = (fgColor) & 0xff;
>          (srcG) = ((fgColor) >> 8) & 0xff;
>          (srcR) = ((fgColor) >> 16) & 0xff;
>          (srcA) = ((fgColor) >> 24) & 0xff;
>      }
>      while (0);
>      if (srcA == 0)
>      {
>          srcR = srcG = srcB = 0;
>          fgColor = 0;
>      }
>      else
>      {
>          if (!(0))
>          {
>              fgColor = (srcA << 24) | (fgColor & 0x00ffffff);
>              ;
>          }
>          if (srcA != 0xff)
>          {
>              do
>              {
>                  srcR = mul8table[srcA][srcR];
>                  srcG = mul8table[srcA][srcG];
>                  srcB = mul8table[srcA][srcB];
>              }
>              while (0);
>          }
>          if (0)
>          {
>              ;
>          }
>      }
>      DstPix = 0;
>      ;
>      rasScan -= width * 4;
>      if (pMask)
>      {
>          pMask += maskOff;
>          maskScan -= width;
>          do
>          {
>              jint w = width;
>              ;
>              do
>              {
>                  jint resA;
>                  jint resR, resG, resB;
>                  jint dstF;
>                  jint pathA = *pMask++;
>                  if (pathA > 0)
>                  {
>                      if (pathA == 0xff)
>                      {
>                          (pRas)[0] = (fgColor);
>                      }
>                      else
>                      {
>                          ;
>                          dstF = 0xff - pathA;
>                          do
>                          {
>                              DstPix = (pRas)[0];
>                              resA = ((juint) DstPix) >> 24;
>                          }
>                          while (0);
>                          resA = mul8table[dstF][resA];
>                          if (!(0))
>                          {
>                              dstF = resA;
>                          }
>                          resA += mul8table[pathA][srcA];
>                          do
>                          {
>                              resR = (DstPix >> 16) & 0xff;
>                              resG = (DstPix >> 8) & 0xff;
>                              resB = (DstPix >> 0) & 0xff;
>                          }
>                          while (0);
>                          do
>                          {
>                              resR = mul8table[dstF][resR] +
> mul8table[pathA][srcR];
>                              resG = mul8table[dstF][resG] +
> mul8table[pathA][srcG];
>                              resB = mul8table[dstF][resB] +
> mul8table[pathA][srcB];
>                          }
>                          while (0);
>                          if (!(0) && resA && resA < 0xff)
>                          {
>                              do
>                              {
>                                  resR = div8table[resA][resR];
>                                  resG = div8table[resA][resG];
>                                  resB = div8table[resA][resB];
>                              }
>                              while (0);
>                          }
>                          (pRas)[0] = (((((((resA) << 8) | (resR)) << 8)
> | (resG)) << 8) | (resB));
>                      }
>                  }
>                  pRas = ((void *) (((intptr_t) (pRas)) + (4)));
>                  ;
>              }
>              while (--w > 0);
>              pRas = ((void *) (((intptr_t) (pRas)) + (rasScan)));
>              ;
>              pMask = ((void *) (((intptr_t) (pMask)) + (maskScan)));
>          }
>          while (--height > 0);
>      }
>      else
>      {
>          do
>          {
>              jint w = width;
>              ;
>              do
>              {
>                  (pRas)[0] = (fgColor);
>                  pRas = ((void *) (((intptr_t) (pRas)) + (4)));
>                  ;
>              }
>              while (--w > 0);
>              pRas = ((void *) (((intptr_t) (pRas)) + (rasScan)));
>              ;
>          }
>          while (--height > 0);
>      }
> }
>
> It seems that alpha blending macros are quite complex and can not be
> vectorized:
>
> Analyzing loop at IntArgb.c:109
> IntArgb.c:109: note: not vectorized: control flow in loop.
> IntArgb.c:109: note: bad inner-loop form.
> IntArgb.c:109: note: not vectorized: Bad inner loop.
> IntArgb.c:109: note: bad loop form.
> Analyzing loop at IntArgb.c:109
> IntArgb.c:109: note: not vectorized: control flow in loop.
> IntArgb.c:109: note: bad loop form.
> Analyzing loop at IntArgb.c:109
> IntArgb.c:109: note: failed: evolution of base is not affine.
> IntArgb.c:109: note: bad data references.
> Analyzing loop at IntArgb.c:109
> IntArgb.c:109: note: Unknown misalignment, is_packed = 0
> IntArgb.c:109: note: virtual phi. skip.
> IntArgb.c:109: note: not vectorized: value used after loop.
> IntArgb.c:109: note: bad operation or unsupported loop bound.
> IntArgb.c:109: note: vectorized 0 loops in function.
> IntArgb.c:109: note: not consecutive access rasScan_26 =
> pRasInfo_25(D)->scanStride;
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
> IntArgb.c:109: note: Unknown alignment for access: mul8table
> IntArgb.c:109: note: not consecutive access _40 =
> mul8table[srcA_36][srcB_33];
> IntArgb.c:109: note: not consecutive access _42 =
> mul8table[srcA_36][srcB_31];
> IntArgb.c:109: note: not consecutive access _44 =
> mul8table[srcA_36][srcB_29];
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
> IntArgb.c:109: note: Unknown alignment for access: *pMask_1
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
> IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
> IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
> IntArgb.c:109: note: Unknown alignment for access: mul8table
> IntArgb.c:109: note: not consecutive access _65 =
> mul8table[dstF_60][resA_64];
> IntArgb.c:109: note: not consecutive access _67 =
> mul8table[pathA_58][srcA_36];
> IntArgb.c:109: note: not consecutive access _75 =
> mul8table[dstF_66][resR_71];
> IntArgb.c:109: note: not consecutive access _77 =
> mul8table[pathA_58][srcB_6];
> IntArgb.c:109: note: not consecutive access _80 =
> mul8table[dstF_66][resG_73];
> IntArgb.c:109: note: not consecutive access _82 =
> mul8table[pathA_58][srcB_7];
> IntArgb.c:109: note: not consecutive access _85 =
> mul8table[dstF_66][resB_74];
> IntArgb.c:109: note: not consecutive access _87 =
> mul8table[pathA_58][srcB_8];
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: Unknown alignment for access: div8table
> IntArgb.c:109: note: not consecutive access _93 =
> div8table[resA_69][resR_79];
> IntArgb.c:109: note: not consecutive access _95 =
> div8table[resA_69][resG_84];
> IntArgb.c:109: note: not consecutive access _97 =
> div8table[resA_69][resB_89];
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
> IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
> IntArgb.c:109: note: Unknown alignment for access: *rasBase_11
> IntArgb.c:109: note: Failed to SLP the basic block.
> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
> basic block.
> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
>
>
> Any idea to make such code faster ? or to make it work with vectorization ?
>
>
> Finally I noticed that the macros with Lcd suffix seems to perform
> proper gamma corrections:
>
> void IntArgbDrawGlyphListLCD(SurfaceDataRasInfo *pRasInfo, ImageRef
> *glyphs, jint totalGlyphs, jint fgpixel, jint argbcolor, jint clipLeft,
> jint clipTop, jint clipRight, jint clipBottom, jint rgbOrder, unsigned
> char *gammaLut, unsigned char * invGammaLut, NativePrimitive *pPrim,
> CompositeInfo *pCompInfo)
> ...
>      srcR = invGammaLut[srcR];
>      srcG = invGammaLut[srcG];
>      srcB = invGammaLut[srcB];
> ...
> alpha blending
> ...
>      dstR = gammaLut[dstR];
>      dstG = gammaLut[dstG];
>      dstB = gammaLut[dstB];
>
> That's exactly what I want to implement the correct gamma correction in
> mask fill operations (shape draw / fill) for software loops (buffered
> image rendering).
>
> I will try now to figure out how that C code is generated by the nested
> macros !
>
> Laurent


-- 
Best regards, Sergey.

From james.graham at oracle.com  Fri Jan 15 22:34:19 2016
From: james.graham at oracle.com (Jim Graham)
Date: Fri, 15 Jan 2016 14:34:19 -0800
Subject: [OpenJDK Rasterizer] AWT & gcc 4.8 optimization options
In-Reply-To: <56996962.2090304@oracle.com>
References: <CAKjRUT7qmxibs-YhHSyoHs4aCR8N=PvadvqhDKfMWeC0iEF-wQ@mail.gmail.com>
	<CAKjRUT7oYaq7saQd591oOOKP28fJPhrQFxst_X9N=_xGU72T+w@mail.gmail.com>
	<CAKjRUT5Bod6br0M4trREcD=uZkbLXxNio5u89yruWCjgG7Owew@mail.gmail.com>
	<CAKjRUT7BEMJu5HF36xDLv2LNs8xSUF420O92W0Y=RFAjPZy+kg@mail.gmail.com>
	<5627E1B0.4060206@oracle.com>
	<CAKjRUT7yFdMTCgRJV3Fj066CLSzn4_DBiHUC3kBbYtcmfufBYA@mail.gmail.com>
	<5658396E.2090605@oracle.com>
	<CAKjRUT6_Dk-yw0c87nfG1+3RNhtja-ihV+-OsuE7wcAn63Vhvw@mail.gmail.com>
	<56683DF8.1020607@oracle.com>
	<CAKjRUT4GVH0ND+pHPHgex7izZVehLjip=fV7=PrSG6C2yDXFzA@mail.gmail.com>
	<CAKjRUT67V6pSEuyYTJ9wUOD7=S7WuqTjwazNbbimE+PYiaB=NA@mail.gmail.com>
	<56996962.2090304@oracle.com>
Message-ID: <569973EB.3050509@oracle.com>

The lookups were written in 1997-ish when processors had different 
vectorization/computation tradeoffs.  It might be interesting to 
investigate a non-table version of the macros and see how the 
performance differs...

			...jim

On 1/15/16 1:49 PM, Sergey Bylokhov wrote:
> Hi,
>
> I found that in case of vectorisation on of the main hotspot is out
> table lookup pattern: mul8table/div8table which cannot be vectorized.
> Another hotspot is a many conditions inside the main loops.
>
> On 15/01/16 20:14, Laurent Bourg?s wrote:
>> Sergey,
>>
>> Did you made any progress ?
>>
>> I finally looked at the preprocessed C code and also enabled
>> ftree-vectorizer-verbose output:
>>      CFLAGS := -save-temps -ftree-vectorize -ftree-vectorizer-verbose=2
>> $(CFLAGS_JDKLIB) $(LIBAWT_CFLAGS), \
>>
>>
>> I looked at the IntArgbPreSrcMaskFill hotspot (in my EllipseFillTest)
>> according to oprofile:
>> samples  %        image name               symbol name
>> 469141   30.0043  libawt.so                IntArgbPreSrcMaskFill
>>
>>
>> Here is the preprocessed C code:
>> - It is still complex to read as there are many do { } while (0) blocks
>> due to macro expansion...
>>
>> void IntArgbSrcMaskFill (void *rasBase, jubyte *pMask, jint maskOff,
>> jint maskScan, jint width, jint height, jint fgColor, SurfaceDataRasInfo
>> *pRasInfo, NativePrimitive *pPrim, CompositeInfo *pCompInfo)
>> {
>>      jint srcA;
>>      jint srcR, srcG, srcB;
>>      jint rasScan = pRasInfo->scanStride;
>>      IntArgbDataType *pRas = (IntArgbDataType *) (rasBase);
>>      jint DstPix;
>>      do
>>      {
>>          (srcB) = (fgColor) & 0xff;
>>          (srcG) = ((fgColor) >> 8) & 0xff;
>>          (srcR) = ((fgColor) >> 16) & 0xff;
>>          (srcA) = ((fgColor) >> 24) & 0xff;
>>      }
>>      while (0);
>>      if (srcA == 0)
>>      {
>>          srcR = srcG = srcB = 0;
>>          fgColor = 0;
>>      }
>>      else
>>      {
>>          if (!(0))
>>          {
>>              fgColor = (srcA << 24) | (fgColor & 0x00ffffff);
>>              ;
>>          }
>>          if (srcA != 0xff)
>>          {
>>              do
>>              {
>>                  srcR = mul8table[srcA][srcR];
>>                  srcG = mul8table[srcA][srcG];
>>                  srcB = mul8table[srcA][srcB];
>>              }
>>              while (0);
>>          }
>>          if (0)
>>          {
>>              ;
>>          }
>>      }
>>      DstPix = 0;
>>      ;
>>      rasScan -= width * 4;
>>      if (pMask)
>>      {
>>          pMask += maskOff;
>>          maskScan -= width;
>>          do
>>          {
>>              jint w = width;
>>              ;
>>              do
>>              {
>>                  jint resA;
>>                  jint resR, resG, resB;
>>                  jint dstF;
>>                  jint pathA = *pMask++;
>>                  if (pathA > 0)
>>                  {
>>                      if (pathA == 0xff)
>>                      {
>>                          (pRas)[0] = (fgColor);
>>                      }
>>                      else
>>                      {
>>                          ;
>>                          dstF = 0xff - pathA;
>>                          do
>>                          {
>>                              DstPix = (pRas)[0];
>>                              resA = ((juint) DstPix) >> 24;
>>                          }
>>                          while (0);
>>                          resA = mul8table[dstF][resA];
>>                          if (!(0))
>>                          {
>>                              dstF = resA;
>>                          }
>>                          resA += mul8table[pathA][srcA];
>>                          do
>>                          {
>>                              resR = (DstPix >> 16) & 0xff;
>>                              resG = (DstPix >> 8) & 0xff;
>>                              resB = (DstPix >> 0) & 0xff;
>>                          }
>>                          while (0);
>>                          do
>>                          {
>>                              resR = mul8table[dstF][resR] +
>> mul8table[pathA][srcR];
>>                              resG = mul8table[dstF][resG] +
>> mul8table[pathA][srcG];
>>                              resB = mul8table[dstF][resB] +
>> mul8table[pathA][srcB];
>>                          }
>>                          while (0);
>>                          if (!(0) && resA && resA < 0xff)
>>                          {
>>                              do
>>                              {
>>                                  resR = div8table[resA][resR];
>>                                  resG = div8table[resA][resG];
>>                                  resB = div8table[resA][resB];
>>                              }
>>                              while (0);
>>                          }
>>                          (pRas)[0] = (((((((resA) << 8) | (resR)) << 8)
>> | (resG)) << 8) | (resB));
>>                      }
>>                  }
>>                  pRas = ((void *) (((intptr_t) (pRas)) + (4)));
>>                  ;
>>              }
>>              while (--w > 0);
>>              pRas = ((void *) (((intptr_t) (pRas)) + (rasScan)));
>>              ;
>>              pMask = ((void *) (((intptr_t) (pMask)) + (maskScan)));
>>          }
>>          while (--height > 0);
>>      }
>>      else
>>      {
>>          do
>>          {
>>              jint w = width;
>>              ;
>>              do
>>              {
>>                  (pRas)[0] = (fgColor);
>>                  pRas = ((void *) (((intptr_t) (pRas)) + (4)));
>>                  ;
>>              }
>>              while (--w > 0);
>>              pRas = ((void *) (((intptr_t) (pRas)) + (rasScan)));
>>              ;
>>          }
>>          while (--height > 0);
>>      }
>> }
>>
>> It seems that alpha blending macros are quite complex and can not be
>> vectorized:
>>
>> Analyzing loop at IntArgb.c:109
>> IntArgb.c:109: note: not vectorized: control flow in loop.
>> IntArgb.c:109: note: bad inner-loop form.
>> IntArgb.c:109: note: not vectorized: Bad inner loop.
>> IntArgb.c:109: note: bad loop form.
>> Analyzing loop at IntArgb.c:109
>> IntArgb.c:109: note: not vectorized: control flow in loop.
>> IntArgb.c:109: note: bad loop form.
>> Analyzing loop at IntArgb.c:109
>> IntArgb.c:109: note: failed: evolution of base is not affine.
>> IntArgb.c:109: note: bad data references.
>> Analyzing loop at IntArgb.c:109
>> IntArgb.c:109: note: Unknown misalignment, is_packed = 0
>> IntArgb.c:109: note: virtual phi. skip.
>> IntArgb.c:109: note: not vectorized: value used after loop.
>> IntArgb.c:109: note: bad operation or unsupported loop bound.
>> IntArgb.c:109: note: vectorized 0 loops in function.
>> IntArgb.c:109: note: not consecutive access rasScan_26 =
>> pRasInfo_25(D)->scanStride;
>> IntArgb.c:109: note: Failed to SLP the basic block.
>> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
>> basic block.
>> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
>> IntArgb.c:109: note: Unknown alignment for access: mul8table
>> IntArgb.c:109: note: not consecutive access _40 =
>> mul8table[srcA_36][srcB_33];
>> IntArgb.c:109: note: not consecutive access _42 =
>> mul8table[srcA_36][srcB_31];
>> IntArgb.c:109: note: not consecutive access _44 =
>> mul8table[srcA_36][srcB_29];
>> IntArgb.c:109: note: Failed to SLP the basic block.
>> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
>> basic block.
>> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
>> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
>> IntArgb.c:109: note: Unknown alignment for access: *pMask_1
>> IntArgb.c:109: note: Failed to SLP the basic block.
>> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
>> basic block.
>> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
>> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
>> IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
>> IntArgb.c:109: note: Failed to SLP the basic block.
>> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
>> basic block.
>> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
>> IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
>> IntArgb.c:109: note: Unknown alignment for access: mul8table
>> IntArgb.c:109: note: not consecutive access _65 =
>> mul8table[dstF_60][resA_64];
>> IntArgb.c:109: note: not consecutive access _67 =
>> mul8table[pathA_58][srcA_36];
>> IntArgb.c:109: note: not consecutive access _75 =
>> mul8table[dstF_66][resR_71];
>> IntArgb.c:109: note: not consecutive access _77 =
>> mul8table[pathA_58][srcB_6];
>> IntArgb.c:109: note: not consecutive access _80 =
>> mul8table[dstF_66][resG_73];
>> IntArgb.c:109: note: not consecutive access _82 =
>> mul8table[pathA_58][srcB_7];
>> IntArgb.c:109: note: not consecutive access _85 =
>> mul8table[dstF_66][resB_74];
>> IntArgb.c:109: note: not consecutive access _87 =
>> mul8table[pathA_58][srcB_8];
>> IntArgb.c:109: note: Failed to SLP the basic block.
>> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
>> basic block.
>> IntArgb.c:109: note: Unknown alignment for access: div8table
>> IntArgb.c:109: note: not consecutive access _93 =
>> div8table[resA_69][resR_79];
>> IntArgb.c:109: note: not consecutive access _95 =
>> div8table[resA_69][resG_84];
>> IntArgb.c:109: note: not consecutive access _97 =
>> div8table[resA_69][resB_89];
>> IntArgb.c:109: note: Failed to SLP the basic block.
>> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
>> basic block.
>> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
>> IntArgb.c:109: note: Unknown alignment for access: *rasBase_9
>> IntArgb.c:109: note: Failed to SLP the basic block.
>> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
>> basic block.
>> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
>> IntArgb.c:109: note: SLP: step doesn't divide the vector-size.
>> IntArgb.c:109: note: Unknown alignment for access: *rasBase_11
>> IntArgb.c:109: note: Failed to SLP the basic block.
>> IntArgb.c:109: note: not vectorized: failed to find SLP opportunities in
>> basic block.
>> IntArgb.c:109: note: not vectorized: not enough data-refs in basic block.
>>
>>
>> Any idea to make such code faster ? or to make it work with
>> vectorization ?
>>
>>
>> Finally I noticed that the macros with Lcd suffix seems to perform
>> proper gamma corrections:
>>
>> void IntArgbDrawGlyphListLCD(SurfaceDataRasInfo *pRasInfo, ImageRef
>> *glyphs, jint totalGlyphs, jint fgpixel, jint argbcolor, jint clipLeft,
>> jint clipTop, jint clipRight, jint clipBottom, jint rgbOrder, unsigned
>> char *gammaLut, unsigned char * invGammaLut, NativePrimitive *pPrim,
>> CompositeInfo *pCompInfo)
>> ...
>>      srcR = invGammaLut[srcR];
>>      srcG = invGammaLut[srcG];
>>      srcB = invGammaLut[srcB];
>> ...
>> alpha blending
>> ...
>>      dstR = gammaLut[dstR];
>>      dstG = gammaLut[dstG];
>>      dstB = gammaLut[dstB];
>>
>> That's exactly what I want to implement the correct gamma correction in
>> mask fill operations (shape draw / fill) for software loops (buffered
>> image rendering).
>>
>> I will try now to figure out how that C code is generated by the nested
>> macros !
>>
>> Laurent
>
>

From bourges.laurent at gmail.com  Thu Jan 21 08:36:52 2016
From: bourges.laurent at gmail.com (=?UTF-8?Q?Laurent_Bourg=C3=A8s?=)
Date: Thu, 21 Jan 2016 09:36:52 +0100
Subject: [OpenJDK Rasterizer] AWT & gcc 4.8 optimization options
In-Reply-To: <569CE936.6010208@oracle.com>
References: <CAKjRUT7qmxibs-YhHSyoHs4aCR8N=PvadvqhDKfMWeC0iEF-wQ@mail.gmail.com>
	<CAKjRUT7oYaq7saQd591oOOKP28fJPhrQFxst_X9N=_xGU72T+w@mail.gmail.com>
	<CAKjRUT5Bod6br0M4trREcD=uZkbLXxNio5u89yruWCjgG7Owew@mail.gmail.com>
	<CAKjRUT7BEMJu5HF36xDLv2LNs8xSUF420O92W0Y=RFAjPZy+kg@mail.gmail.com>
	<5627E1B0.4060206@oracle.com>
	<CAKjRUT7yFdMTCgRJV3Fj066CLSzn4_DBiHUC3kBbYtcmfufBYA@mail.gmail.com>
	<5658396E.2090605@oracle.com>
	<CAKjRUT6_Dk-yw0c87nfG1+3RNhtja-ihV+-OsuE7wcAn63Vhvw@mail.gmail.com>
	<56683DF8.1020607@oracle.com>
	<CAKjRUT4GVH0ND+pHPHgex7izZVehLjip=fV7=PrSG6C2yDXFzA@mail.gmail.com>
	<CAKjRUT67V6pSEuyYTJ9wUOD7=S7WuqTjwazNbbimE+PYiaB=NA@mail.gmail.com>
	<56996962.2090304@oracle.com>
	<CAKjRUT5pG3PVJxZZr6r5pa2wpsYgFmwCpcBXv9=4bmy1xoSHfA@mail.gmail.com>
	<569CE936.6010208@oracle.com>
Message-ID: <CAKjRUT7X8YxEU96AzSeDmRYMRqhVXy5tz=-cd5Vv=Hde4M-gXg@mail.gmail.com>

Sergey,

>> So it looks scalar operations on vector (4) ie vectorization should be
>> applicable.
>
>
> yes, I think so.

I googled a bit and it seems tricky to implement alpha blending with sse2
but many projects succeeded by using writing directly sse2 primitives !

>> Maybe the conditions (pathA > 0) && (pathA < 0xff) are a bigger penalty
>> as they can not be easily predicted (but may happen often).
>> Sometimes it is faster to perform useless math operations without
>> branching (gpu approach).
>>
>> Do you have other ideas to make it faster ? as it represents 30% of the
>> ellipse fill test (huge ellipses).
>> I noticed that larger tiles (64x64) are a bit faster (larger tile width
>> / height, less jni calls)
>
>
> I just commented out some of the code inside this method and checks the
performance. It seems that the simple code like:
> inloop->readBytes->decodeRGB->encodeBytes->saveBytes is quite fast. But
if some branch/multiplication are added after decodeRGB then the code
became really slow(x10 slower on my system). This is expected because we
complete huge number of multiplications, but if I try to make the same math
standalone(without byte decoding) then the result is fast also. So it seems
that we slow because of mixing of byteReading/branches/multipliation.

It seems possible to for RGBA:
- compute A+G and R+B together (2?16bits) to double the throughput
- use bit shifts instead of mul / div

Could you try implementing such variants ?

>> Should I try (as I did in the past) to implement the MaskFill in Java to
>> benefit from hotspot optimizations (like Marlin) ?
>
>
> It will be interesting. I remember that someone already tried to do the
same, but I do not remember the result. Probably Jim can suggest something.

I implemented alpha blending in java last year (using custom composite
operator hack):
http://mail.openjdk.java.net/pipermail/2d-dev/2014-August/004751.html

I could try soon optimizing my java impl...

Cheers,
Laurent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/graphics-rasterizer-dev/attachments/20160121/2602533a/attachment-0001.html>