RFR: 8296548: Improve MD5 intrinsic for x86_64
Evgeny Astigeevich
eastigeevich at openjdk.org
Tue Nov 15 16:34:58 UTC 2022
On Mon, 14 Nov 2022 15:47:25 GMT, Ludovic Henry <luhenry at openjdk.org> wrote:
>> The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput.
>>
>> This change replaces
>> LEA: r1 = r1 + rsi * 1 + t
>> with
>> ADDs: r1 += t; r1 += rsi.
>>
>> Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc.
>>
>> No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc.
>>
>> Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake.
>
> Could you please post JMH microbenchmarks with and without this change? You can run them with `org.openjdk.bench.java.security.MessageDigests` [1]
>
> [1] https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java
@luhenry, @vnkozlov
Sorry for the uninformative PR description.
In the MD5 intrinsic stub we use 3 operand LEA. This LEA is on the critical path.
The optimization is done according to the Intel 64 and IA-32 Architectures Optimization Reference Manual (Feb 2022), 3.5.1.2:
In Sandy Bridge microarchitecture, there are two significant changes to the performance characteristics of LEA instruction:
For LEA instructions with three source operands and some specific situations, instruction latency has increased to 3 cycles, and must dispatch via port 1:
— LEA that has all three source operands: base, index, and offset.
— LEA that uses base and index registers where the base is EBP, RBP, or R13.
— LEA that uses RIP relative addressing mode.
— LEA that uses 16-bit addressing mode.
Assembly/Compiler Coding Rule 30. (ML impact, L generality) If an LEA instruction using the scaled index is on the critical path, a sequence with ADDs may be better.
ADD has had latency 1 and throughput 4 since Haswell (see https://www.agner.org/optimize/instruction_tables.pdf).
>From https://www.agner.org/optimize/instruction_tables.pdf, in Ice Lake LEA performance was improved to latency 1 and throughput 2. This explains no improvement on it.
The patch correctness was tested with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics.
The microbenchmark we used:
import org.apache.commons.lang3.RandomStringUtils;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.BenchmarkParams;
import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;
import java.util.stream.IntStream;
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class MD5Benchmark {
private static final int MAX_INPUTS_COUNT = 1000;
private static final int MAX_INPUT_LENGTH = 128 * 1024;
private static List<byte[]> inputs;
static {
inputs = new ArrayList<>();
IntStream.rangeClosed(1, MAX_INPUTS_COUNT).forEach(value -> inputs.add(RandomStringUtils.randomAlphabetic(MAX_INPUT_LENGTH).getBytes(StandardCharsets.UTF_8)));
}
@Param({"64", "128", "256", "512", "1024", "2048", "4096", "8192", "16384", "32768", "65536", "131072"})
private int data_len;
@State(Scope.Thread)
public static class InputData {
byte[] data;
int count;
byte[] expectedDigest;
byte[] digest;
@Setup
public void setup(BenchmarkParams params) {
data = inputs.get(ThreadLocalRandom.current().nextInt(0, MAX_INPUTS_COUNT));
count = Integer.parseInt(params.getParam("data_len"));
expectedDigest = calculateJdkMD5Checksum(data, count);
}
@TearDown
public void check() {
if (!Arrays.equals(expectedDigest, digest)) {
throw new RuntimeException("Expected md5 digest:\n" + Arrays.toString(expectedDigest) +
"\nGot:\n" + Arrays.toString(digest));
}
}
}
@Benchmark
public void testMD5(InputData in) {
in.digest = calculateMD5Checksum(in.data, in.count);
}
private static byte[] calculateMD5Checksum(byte[] input, int count) {
try {
MessageDigest md5 = MessageDigest.getInstance("MD5");
md5.update(input, 0, count);
return md5.digest();
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
-------------
PR: https://git.openjdk.org/jdk/pull/11054
More information about the hotspot-compiler-dev
mailing list