Optimizing byte reverse code for int value

Doerr, Martin martin.doerr at sap.com
Fri Apr 7 07:12:34 UTC 2017


Hi Michihiro,

thanks for providing the webrev. I appreciate improvements for this bottleneck.

After taking a first look over it, it looks good to me for ppc64le.
But I think it would break big endian platforms.

I suggest replacing the use of loadI by endianness specific code (which could possibly use lwbrx on big endian).

Best regards,
Martin


From: Michihiro Horie [mailto:HORIE at jp.ibm.com]
Sent: Freitag, 7. April 2017 07:50
To: ppc-aix-port-dev at openjdk.java.net; hotspot-dev at openjdk.java.net
Cc: Doerr, Martin <martin.doerr at sap.com>; Simonis, Volker <volker.simonis at sap.com>; volker.simonis at gmail.com; Hiroshi H Horii <HORII at jp.ibm.com>; Lindenmaier, Goetz <goetz.lindenmaier at sap.com>; Gustavo Bueno Romero <gromero at br.ibm.com>
Subject: Optimizing byte reverse code for int value


Dear all,

Would you please review our change for JDK10 on ppc64?
Issue: https://bugs.openjdk.java.net/browse/JDK-8178294
Webrev: http://cr.openjdk.java.net/~horii/8178294/webrev.00/

This change adds two conversion rules of reversing contiguous 4 bytes for int value.
The first conversion rule finds a pattern below and emits a lwz instruction instead.

Original:
lbz r14,19(r12)
lbz r11,17(r12)
lbz r10,18(r12)
lbz r9,16(r12)
extsb r14,r14
rlwinm r10,r10,16,0,15
rlwinm r14,r14,24,0,7
add r14,r10,r14
rlwinm r11,r11,8,0,23
add r12,r11,r9
add r14,r12,r14

Optimization with first conversion rule:
lwz r14,16(r12)


The second conversion rule finds a pattern below and emits only lfs instruction.

Original:
lbz r14,19(r12)
lbz r11,17(r12)
lbz r10,18(r12)
lbz r9,16(r12)
extsb r14,r14
rlwinm r10,r10,16,0,15
rlwinm r14,r14,24,0,7
add r14,r10,r14
rlwinm r11,r11,8,0,23
add r12,r11,r9
add r14,r12,r14
stw r14,156(r1)
lfs f12,156(r1)

Optimization with first conversion rule:
lfs f12,156(r1)


Our motivation comes from the fact that a performance bottleneck exists in byte reversing code in Apache ORC on
Tez framework as shown below.
https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/SerializationUtils.java
We believe this kind of procedures is typical in Java.

public float readFloat(InputStream in) throws IOException {
readFully(in, readBuffer, 0, 4);
int val = (((readBuffer[0] & 0xff) << 0)
+ ((readBuffer[1] & 0xff) << 8)
+ ((readBuffer[2] & 0xff) << 16)
+ ((readBuffer[3] & 0xff) << 24));
return Float.intBitsToFloat(val);
}


By using our change, we could observe 5% performance improvement in a micro benchmark.
(See attached file: ReadFloatTest.java)

Best regards,
--
Michihiro,
IBM Research - Tokyo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/attachments/20170407/b2cb1581/attachment-0001.html>


More information about the ppc-aix-port-dev mailing list