Request for review: 6896617: Optimize sun.nio.cs.ISO_8859_1$Encode.encodeArrayLoop() on x86

Tue Jan 22 12:20:42 PST 2013

Thank you, Ulf

I will rename method to encodeISOArray and add comment that it could be 
replaced by intrinsic by JVM.

The same arrays were use intentionally in test to get performance of 
code without effect of CPU's cache/memory subsystem.

The method encodeArrayLoop() is also compiled because it is invoked > 
10000 times. And I don't see an effect on performance of its code change 
so I will leave it as it is.

Thanks,
Vladimir

On 1/21/13 3:30 PM, Ulf Zibis wrote:
> Am 17.01.2013 05:27, schrieb Vladimir Kozlov:
>> On 1/12/13 12:37 AM, Ulf Zibis wrote:
>>> 3) bugs.sun.com/bugdatabase/view_bug.do?bug_id=6896617 ==> This bug is
>>> not available.
>>
>> I opened it, should show up in few days.
>
> Thanks!
>
>>
>>> 4) What specific operation should be done by the intrinsic, i.e. is
>>> there a fixed API for that method ???
>>
>> When C2 (server JIT compiler in JVM) compiles encode methods it will
>> replace new method encodeArray() (matched by signature) with hand
>> optimized assembler code which uses latest processor instructions. I
>> will send Hotspot changes soon. So it is nothing to do with
>> interpreter or bytecode sequence.
>>
>>> 5) Can an intrinsic write back more than 1 value (see my hack via int[]
>>> p) ?
>>> 6) Vladimir's webrev shows an integer as return type for that method,
>>> I've added a variant with boolean return type, and the code from my last
>>> approach could be transformed to a method with Object return type.
>
> I wanted to say, there would maybe different results for the surrounding
> code depending on the API we choose for the intrinsic method call. If
> the surrounding code is *as short as possible*, there is a better
> chance, it will be JIT-compiled too after fewer invocations of
> encodeArrayLoop(). I guess, the fastest would be:
>
> // while inlinig, JIT will erase the surrounding int[] p
> private static CoderResult copyISOs(CoderResult cr,
>          char[] sa, byte[] da, final int[] p, int sl) {
>      for (int sp = p[0], dp = p[1]; sp < sl; sp++) {
>          char c = sa[sp];
>          // if (c & '\uFF00' != 0) // needs bug 6935994 to be fixed,
> would be fastest
>          if ((byte)(c >> 8) != 0) // temporary replacement, fast
> byte-length operation on x86
>              return null;
>          da[dp++] = (byte)c;
>      }
>      return cr;
> }
>
> // No more needs try...finally block
> private CoderResult encodeArrayLoop(
>          CharBuffer src, ByteBuffer dst) {
>      char[] sa = src.array();
>      int soff = src.arrayOffset();
>      int sp = soff + src.position();
>      int sr = src.remaining();
>      byte[] da = dst.array();
>      int doff = dst.arrayOffset();
>      int dp = doff + dst.position();
>      int dr = dst.remaining();
>      CoderResult cr;
>      if (dr < sr) {
>          sr = dr;
>          cr = CoderResult.OVERFLOW;
>      } else
>          cr = CoderResult.UNDERFLOW;
>      int sl = sp + sr;
>      final int[] p = { sp, dp };
>      cr = copyISOs(cr, sa, da, p, sl);
>      src.position(p[0] - soff);
>      dst.position(p[1] - doff);
>      return result(cr, sa, p[0], sl);
>
> }
>
> // if adapted, maybe could also be reused in encodeBufferLoop()
> private static CoderResult result(CoderResult cr, byte[] sa, int sp, int
> sl) {
>      return cr != null ? cr :
>          sgp.parse(sa[sp], sa, sp, sl) < 0
>              ? sgp.error();
>              : sgp.unmappableResult();
> }
>
>
>>>
>>> ... so waiting for Vladimir's feedback :-[
>>> (especially on performance/hsdis results)
>>
>> Performance on x86 tested with next code (whole test will be in
>> Hotspot changes) :
>>
>>         ba = CharBuffer.wrap(a);
>>         bb = ByteBuffer.wrap(b);
>>         long start = System.currentTimeMillis();
>>         for (int i = 0; i < 1000000; i++) {
>>             ba.clear(); bb.clear();
>>             enc_res = enc_res && enc.encode(ba, bb, true).isUnderflow();
>>         }
>>         long end = System.currentTimeMillis();
>
> 1.) Wouldn't System.nanoTime() give more accurate results?
> 2.) I want to point out that it is not real world scenario, encoding the
> same data 1.000.000 times. If same data is used, it is likely, that the
> data itself becomes cached in the inner CPU cache so should have very
> fast access times, which would be not the case on real world data.
> 3.) It would also be interesting to see the results for less than
> 1.000.000 iterations in considering, the surrounding code would be
> JIT-compiled or not. Also performance on C1 should be tested.
>
> I also worry about the naming of method encodeArray(...). I think, it
> should reflect the fact, that it only encodes ISO-8859-1 charset characters.
>
> Please add a comment on the fact, that method encodeArray(...) is
> intended to be intrinsified.
>
> -Ulf
>
>