UTF-8 Validation with the Vector API (Performance)

Paul Sandoz paul.sandoz at oracle.com
Mon Mar 8 21:02:38 UTC 2021



On Mar 8, 2021, at 11:57 AM, August Nagro <augustnagro at gmail.com<mailto:augustnagro at gmail.com>> wrote:

That's awesome, thanks for taking a look!

> Vector.slice(int origin, Vector<E> v1) is not currently optimized.

This is how simd-json implements rotate:

template<int N=1>
simdjson_really_inline simd8<T> prev(const simd8<T> prev_chunk) const {
  return _mm256_alignr_epi8(*this, _mm256_permute2x128_si256(prev_chunk, *this, 0x21), 16 - N);
}

It's two instructions which is great, and if you invert the method it would work for a subset of slice origins.

https://github.com/simdjson/simdjson/blob/master/include/simdjson/haswell/simd.h#L52<https://urldefense.com/v3/__https://github.com/simdjson/simdjson/blob/master/include/simdjson/haswell/simd.h*L52__;Iw!!GqivPVa7Brio!LaOiN2w1Jw1bNxdWTUDkUx2D50yg0QMVwRiq09OjYrUCRFuwEAZYlRN_q6BMs6GDzw$>


Thanks, useful to know when we get around to optimizing this method.


> Vectors held in final fields of LookupTables might not be treated as constant.

Ah, I forgot about that.

> there is an issue with the way C2 handles constant vectors like zero

Really interesting findings. When I ran my benchmark in December I got 50_000 ops/sec for the all-ascii 20k.txt, but when I tried again with a fresh build on March 3rd it was only 15_145. So perhaps there was a regression in that time,

Yes, I suspect so.


but I also wonder if the vectorIntrinsics branch (which I built) is missing Vladimir's branch-prediction patch. It's a little confusing to me how two different git repos are being used for the same project.

In jdk/jdk: https://github.com/openjdk/jdk/commit/28fcb5aebf8885c63ce97a064a1d8e4ef89b0a33<https://urldefense.com/v3/__https://github.com/openjdk/jdk/commit/28fcb5aebf8885c63ce97a064a1d8e4ef89b0a33__;!!GqivPVa7Brio!LaOiN2w1Jw1bNxdWTUDkUx2D50yg0QMVwRiq09OjYrUCRFuwEAZYlRN_q6BzkRasjQ$>

That is from Vladimir’s branch:

https://github.com/iwanowww/jdk/commit/28fcb5aebf8885c63ce97a064a1d8e4ef89b0a33
https://github.com/openjdk/jdk/compare/master...iwanowww:vector.phi

AFAICT has not been committed to jdk/master, nor to panama-vector/vectorIntrinsics. Vladimir what’s the status of this, still too experimental?


In panama-vector (not found): https://github.com/openjdk/panama-vector/commit/28fcb5aebf8885c63ce97a064a1d8e4ef89b0a33<https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/commit/28fcb5aebf8885c63ce97a064a1d8e4ef89b0a33__;!!GqivPVa7Brio!LaOiN2w1Jw1bNxdWTUDkUx2D50yg0QMVwRiq09OjYrUCRFuwEAZYlRN_q6Bd-QIgMw$>


The panama-vector/vectorIntrinsics branch has additional features, API or otherwise, that may be experimental, or need time to bake, before we bring them into the main repository (some of which will be brought in via a JEP).

We will often fix issues directly in jdk/master, which make their way into the panama-vector/vectorIntrinsics when we merge (most recent merge occurred on March 7th).

Generally, you can consider jdk/master to be a subset of panama-vector/vectorIntrinsics. As such there may be performance differences between the two.

Hth,
Paul.


Regards,

August


On Fri, Mar 5, 2021 at 12:59 PM Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote:
Looking at the code I spot three general issues with the Vector API:

1. Vector.slice(int origin, Vector<E> v1) is not currently optimized.
We need to fix this.

2. Vectors held in final fields of LookupTables might not be treated as constant.
Even though the LookupTables instance is held in a static field of the benchmark, HotSpot does not by default propagate to final fields.
It might hoist the values outside the loop though (need to verify).
(There is an ongoing bug to track support for final fields being really final. It’s complicated due to reflection, and deserialization.)

3. Masked loads are not yet optimal (but since this is performed at the end the impact is likely minimal).


Digging deeper and focusing on just ASCII (using 20k.txt) I think there is an issue with the way C2 handles constant vectors like zero (could be a regression), which causes the values to be spilled on the stack which seems to cause other spills.

So, perversely, let's create the zero vector from an array. Here’s your method just focusing on ASCII:


    public static boolean validate(byte[] buf, VectorSpecies<Byte> species, LookupTables lut) {
//        ByteVector zero = ByteVector.zero(species);
        ByteVector zero = ByteVector.fromArray(species, new byte[species.length()], 0);
        ByteVector error = zero;
        Vector<Byte> prevIncomplete = zero;

        int i = 0;
        for (; i < species.loopBound(buf.length); i += species.length()) {
            ByteVector input = ByteVector.fromArray(species, buf, i);

            boolean isUTF8 = input.compare(LT, zero).anyTrue();
            if (!isUTF8) {
                error = error.or(prevIncomplete);
            }
        }

        VectorMask<Byte> m = species.indexInRange(i, buf.length);
        ByteVector input = ByteVector.fromArray(species, buf, i, m);
        boolean isUTF8 = input.compare(LT, zero).anyTrue();

        error = error.or(prevIncomplete);
        return error.compare(EQ, zero).allTrue();
    }

And run using a recent build of 17. The hot loop is:


 3.35%  ↗  0x000000011a7dcf40:   cmp    %r11d,%r9d
        │  0x000000011a7dcf43:   jae    0x000000011a7dd578
 2.73%  │  0x000000011a7dcf49:   mov    0x20(%rsp),%rcx
 7.02%  │  0x000000011a7dcf4e:   vmovdqu 0x10(%rcx,%r9,1),%ymm2
 8.31%  │  0x000000011a7dcf55:   vpcmpgtb %ymm2,%ymm3,%ymm2
10.43%  │  0x000000011a7dcf59:   vptest %ymm0,%ymm2
13.04%  │  0x000000011a7dcf5e:   setne  %cl
10.57%  │  0x000000011a7dcf61:   movzbl %cl,%ecx
 6.82%  │  0x000000011a7dcf64:   test   %ecx,%ecx
        │  0x000000011a7dcf66:   jne    0x000000011a7dd5a0
 6.59%  │  0x000000011a7dcf6c:   mov    0x118(%r15),%rcx
 6.84%  │  0x000000011a7dcf73:   vpor   %ymm3,%ymm1,%ymm1
 3.51%  │  0x000000011a7dcf77:   add    0x18(%rsp),%r9d
 2.71%  │  0x000000011a7dcf7c:   test   %eax,(%rcx)
 7.60%  │  0x000000011a7dcf7e:   xchg   %ax,%ax
 7.58%  │  0x000000011a7dcf80:   cmp    %r10d,%r9d
        ╰  0x000000011a7dcf83:   jl     0x000000011a7dcf40


That ok, not great, HotSpot does not unroll, there are redundant bound checks, the species length is spilled on the stack, and there appears to be a safe point check.

Something ain’t quite right. I think the loop shape is being “polluted" by the processing of the array tail after the loop (confirmed by removing the array tail processing).

However, things get really bad if we swap in zero created from the species, then the performance nose dives by ~7x and there are many spills in the hot loop.

We need a C2 expert to look more closely at why:

1. The loop shape is being affect by processing outside of the loop
2. Why use of the idiomatic zero vector causes so many spills.

Paul.


On Mar 5, 2021, at 9:24 AM, Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote:

Hi August,

Thank you for bringing this to the list (I saw your messages on twitter and was gonna suggest you do just that but you got there before me).

This is exactly the kind of thing we are looking for to exercise the API and find performance issues. I shall take a closer look.

We have been methodically working through some performance issues based on other use cases, I think we will get there.

Paul,

On Mar 4, 2021, at 3:49 PM, August Nagro <augustnagro at gmail.com<mailto:augustnagro at gmail.com>> wrote:

Hello,

A while back I implemented simd-json's UTF-8 validation using the
vector API. It could be considered the first step towards implementing
simd-json completely with Java.

The simd-json developers seem interested, which is cool. The only
problem is that it's very slow, and I don't have the knowledge to make
it faster. Hopefully I can get away with saying it's the Vector api's
fault and not mine. :)

If anyone has suggestions or is interested in grocking the code
(there's not much of it), this is the github repo:
https://github.com/AugustNagro/utf8.java<https://urldefense.com/v3/__https://github.com/AugustNagro/utf8.java__;!!GqivPVa7Brio!LaOiN2w1Jw1bNxdWTUDkUx2D50yg0QMVwRiq09OjYrUCRFuwEAZYlRN_q6DsXXOTpw$>

Cheers,

August





More information about the panama-dev mailing list