UTF-8 Validation with the Vector API (Performance)

August Nagro augustnagro at gmail.com
Mon Mar 8 19:57:33 UTC 2021


That's awesome, thanks for taking a look!

> Vector.slice(int origin, Vector<E> v1) is not currently optimized.

This is how simd-json implements rotate:

template<int N=1>
simdjson_really_inline simd8<T> prev(const simd8<T> prev_chunk) const {
  return _mm256_alignr_epi8(*this, _mm256_permute2x128_si256(prev_chunk,
*this, 0x21), 16 - N);
}

It's two instructions which is great, and if you invert the method it would
work for a subset of slice origins.

https://github.com/simdjson/simdjson/blob/master/include/simdjson/haswell/simd.h#L52

> Vectors held in final fields of LookupTables might not be treated as
constant.

Ah, I forgot about that.

> there is an issue with the way C2 handles constant vectors like zero

Really interesting findings. When I ran my benchmark in December I got
50_000 ops/sec for the all-ascii 20k.txt, but when I tried again with a
fresh build on March 3rd it was only 15_145. So perhaps there was a
regression in that time, but I also wonder if the vectorIntrinsics branch
(which I built) is missing Vladimir's branch-prediction patch. It's a
little confusing to me how two different git repos are being used for the
same project.

In jdk/jdk:
https://github.com/openjdk/jdk/commit/28fcb5aebf8885c63ce97a064a1d8e4ef89b0a33
In panama-vector (not found):
https://github.com/openjdk/panama-vector/commit/28fcb5aebf8885c63ce97a064a1d8e4ef89b0a33

Regards,

August


On Fri, Mar 5, 2021 at 12:59 PM Paul Sandoz <paul.sandoz at oracle.com> wrote:

> Looking at the code I spot three general issues with the Vector API:
>
> 1. Vector.slice(int origin, Vector<E> v1) is not currently optimized.
> We need to fix this.
>
> 2. Vectors held in final fields of LookupTables might not be treated as
> constant.
> Even though the LookupTables instance is held in a static field of the
> benchmark, HotSpot does not by default propagate to final fields.
> It might hoist the values outside the loop though (need to verify).
> (There is an ongoing bug to track support for final fields being really
> final. It’s complicated due to reflection, and deserialization.)
>
> 3. Masked loads are not yet optimal (but since this is performed at the
> end the impact is likely minimal).
>
>
> Digging deeper and focusing on just ASCII (using 20k.txt) I think there is
> an issue with the way C2 handles constant vectors like zero (could be a
> regression), which causes the values to be spilled on the stack which seems
> to cause other spills.
>
> So, perversely, let's create the zero vector from an array. Here’s your
> method just focusing on ASCII:
>
>     public static boolean validate(byte[] buf, VectorSpecies<Byte> species, LookupTables lut) {
> //        ByteVector zero = ByteVector.zero(species);
>         ByteVector zero = ByteVector.fromArray(species, new byte[species.length()], 0);
>         ByteVector error = zero;
>         Vector<Byte> prevIncomplete = zero;
>
>         int i = 0;
>         for (; i < species.loopBound(buf.length); i += species.length()) {
>             ByteVector input = ByteVector.fromArray(species, buf, i);
>
>             boolean isUTF8 = input.compare(LT, zero).anyTrue();
>             if (!isUTF8) {
>                 error = error.or(prevIncomplete);
>             }
>         }
>
>         VectorMask<Byte> m = species.indexInRange(i, buf.length);
>         ByteVector input = ByteVector.fromArray(species, buf, i, m);
>         boolean isUTF8 = input.compare(LT, zero).anyTrue();
>
>         error = error.or(prevIncomplete);
>         return error.compare(EQ, zero).allTrue();
>     }
>
>
> And run using a recent build of 17. The hot loop is:
>
>
>  3.35%  ↗  0x000000011a7dcf40:   cmp    %r11d,%r9d
>         │  0x000000011a7dcf43:   jae    0x000000011a7dd578
>  2.73%  │  0x000000011a7dcf49:   mov    0x20(%rsp),%rcx
>  7.02%  │  0x000000011a7dcf4e:   vmovdqu 0x10(%rcx,%r9,1),%ymm2
>  8.31%  │  0x000000011a7dcf55:   vpcmpgtb %ymm2,%ymm3,%ymm2
> 10.43%  │  0x000000011a7dcf59:   vptest %ymm0,%ymm2
> 13.04%  │  0x000000011a7dcf5e:   setne  %cl
> 10.57%  │  0x000000011a7dcf61:   movzbl %cl,%ecx
>  6.82%  │  0x000000011a7dcf64:   test   %ecx,%ecx
>         │  0x000000011a7dcf66:   jne    0x000000011a7dd5a0
>  6.59%  │  0x000000011a7dcf6c:   mov    0x118(%r15),%rcx
>  6.84%  │  0x000000011a7dcf73:   vpor   %ymm3,%ymm1,%ymm1
>  3.51%  │  0x000000011a7dcf77:   add    0x18(%rsp),%r9d
>  2.71%  │  0x000000011a7dcf7c:   test   %eax,(%rcx)
>  7.60%  │  0x000000011a7dcf7e:   xchg   %ax,%ax
>  7.58%  │  0x000000011a7dcf80:   cmp    %r10d,%r9d
>         ╰  0x000000011a7dcf83:   jl     0x000000011a7dcf40
>
>
> That ok, not great, HotSpot does not unroll, there are redundant bound
> checks, the species length is spilled on the stack, and there appears to be
> a safe point check.
>
> Something ain’t quite right. I think the loop shape is being “polluted" by
> the processing of the array tail after the loop (confirmed by removing the
> array tail processing).
>
> However, things get really bad if we swap in zero created from the
> species, then the performance nose dives by ~7x and there are many spills
> in the hot loop.
>
> We need a C2 expert to look more closely at why:
>
> 1. The loop shape is being affect by processing outside of the loop
> 2. Why use of the idiomatic zero vector causes so many spills.
>
> Paul.
>
>
> On Mar 5, 2021, at 9:24 AM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>
> Hi August,
>
> Thank you for bringing this to the list (I saw your messages on twitter
> and was gonna suggest you do just that but you got there before me).
>
> This is exactly the kind of thing we are looking for to exercise the API
> and find performance issues. I shall take a closer look.
>
> We have been methodically working through some performance issues based on
> other use cases, I think we will get there.
>
> Paul,
>
> On Mar 4, 2021, at 3:49 PM, August Nagro <augustnagro at gmail.com> wrote:
>
> Hello,
>
> A while back I implemented simd-json's UTF-8 validation using the
> vector API. It could be considered the first step towards implementing
> simd-json completely with Java.
>
> The simd-json developers seem interested, which is cool. The only
> problem is that it's very slow, and I don't have the knowledge to make
> it faster. Hopefully I can get away with saying it's the Vector api's
> fault and not mine. :)
>
> If anyone has suggestions or is interested in grocking the code
> (there's not much of it), this is the github repo:
> https://github.com/AugustNagro/utf8.java
>
> Cheers,
>
> August
>
>
>
>


More information about the panama-dev mailing list