RFR: 8296548: Improve MD5 intrinsic for x86_64

Tue Nov 15 16:34:58 UTC 2022

On Mon, 14 Nov 2022 15:47:25 GMT, Ludovic Henry <luhenry at openjdk.org> wrote:

>> The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput.
>> 
>> This change replaces
>>     LEA:  r1 = r1 + rsi * 1 + t
>> with
>>     ADDs: r1 += t; r1 += rsi.
>> 
>> Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc.
>> 
>> No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc.
>> 
>> Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake.
>
> Could you please post JMH microbenchmarks with and without this change? You can run them with `org.openjdk.bench.java.security.MessageDigests` [1]
> 
> [1] https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java

@luhenry, @vnkozlov 
Sorry for the uninformative PR description.

In the MD5 intrinsic stub we use 3 operand LEA. This LEA is on the critical path.

The optimization is done according to the Intel 64 and IA-32 Architectures Optimization Reference Manual (Feb 2022), 3.5.1.2:

In Sandy Bridge microarchitecture, there are two significant changes to the performance characteristics of LEA instruction:
For LEA instructions with three source operands and some specific situations, instruction latency has increased to 3 cycles, and must dispatch via port 1:
— LEA that has all three source operands: base, index, and offset.
— LEA that uses base and index registers where the base is EBP, RBP, or R13.
— LEA that uses RIP relative addressing mode.
— LEA that uses 16-bit addressing mode.

Assembly/Compiler Coding Rule 30. (ML impact, L generality) If an LEA instruction using the scaled index is on the critical path, a sequence with ADDs may be better.

ADD has had latency 1 and throughput 4 since Haswell (see https://www.agner.org/optimize/instruction_tables.pdf).
>From https://www.agner.org/optimize/instruction_tables.pdf, in Ice Lake LEA performance was improved to latency 1 and throughput 2. This explains no improvement on it.

The patch correctness was tested with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics.
The microbenchmark we used:

import org.apache.commons.lang3.RandomStringUtils;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.BenchmarkParams;

import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;
import java.util.stream.IntStream;

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class MD5Benchmark {

    private static final int MAX_INPUTS_COUNT = 1000;
    private static final int MAX_INPUT_LENGTH = 128 * 1024;
    private static List<byte[]> inputs;

    static {
        inputs = new ArrayList<>();
        IntStream.rangeClosed(1, MAX_INPUTS_COUNT).forEach(value -> inputs.add(RandomStringUtils.randomAlphabetic(MAX_INPUT_LENGTH).getBytes(StandardCharsets.UTF_8)));
    }

    @Param({"64", "128", "256", "512", "1024", "2048", "4096", "8192", "16384", "32768", "65536", "131072"})
    private int data_len;

    @State(Scope.Thread)
    public static class InputData {
        byte[] data;
        int count;
        byte[] expectedDigest;
        byte[] digest;

        @Setup
        public void setup(BenchmarkParams params) {
            data = inputs.get(ThreadLocalRandom.current().nextInt(0, MAX_INPUTS_COUNT));
            count = Integer.parseInt(params.getParam("data_len"));
            expectedDigest = calculateJdkMD5Checksum(data, count);
        }

        @TearDown
        public void check() {
            if (!Arrays.equals(expectedDigest, digest)) {
                throw new RuntimeException("Expected md5 digest:\n" + Arrays.toString(expectedDigest) +
                                           "\nGot:\n" + Arrays.toString(digest));
            }
        }
    }

    @Benchmark
    public void testMD5(InputData in) {
        in.digest = calculateMD5Checksum(in.data, in.count);
    }

    private static byte[] calculateMD5Checksum(byte[] input, int count) {
        try {
            MessageDigest md5 = MessageDigest.getInstance("MD5");
            md5.update(input, 0, count);
            return md5.digest();
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

-------------

PR: https://git.openjdk.org/jdk/pull/11054