UTF-8 Validation with the Vector API (Performance)
August Nagro
augustnagro at gmail.com
Mon Mar 8 19:57:33 UTC 2021
That's awesome, thanks for taking a look!
> Vector.slice(int origin, Vector<E> v1) is not currently optimized.
This is how simd-json implements rotate:
template<int N=1>
simdjson_really_inline simd8<T> prev(const simd8<T> prev_chunk) const {
return _mm256_alignr_epi8(*this, _mm256_permute2x128_si256(prev_chunk,
*this, 0x21), 16 - N);
}
It's two instructions which is great, and if you invert the method it would
work for a subset of slice origins.
https://github.com/simdjson/simdjson/blob/master/include/simdjson/haswell/simd.h#L52
> Vectors held in final fields of LookupTables might not be treated as
constant.
Ah, I forgot about that.
> there is an issue with the way C2 handles constant vectors like zero
Really interesting findings. When I ran my benchmark in December I got
50_000 ops/sec for the all-ascii 20k.txt, but when I tried again with a
fresh build on March 3rd it was only 15_145. So perhaps there was a
regression in that time, but I also wonder if the vectorIntrinsics branch
(which I built) is missing Vladimir's branch-prediction patch. It's a
little confusing to me how two different git repos are being used for the
same project.
In jdk/jdk:
https://github.com/openjdk/jdk/commit/28fcb5aebf8885c63ce97a064a1d8e4ef89b0a33
In panama-vector (not found):
https://github.com/openjdk/panama-vector/commit/28fcb5aebf8885c63ce97a064a1d8e4ef89b0a33
Regards,
August
On Fri, Mar 5, 2021 at 12:59 PM Paul Sandoz <paul.sandoz at oracle.com> wrote:
> Looking at the code I spot three general issues with the Vector API:
>
> 1. Vector.slice(int origin, Vector<E> v1) is not currently optimized.
> We need to fix this.
>
> 2. Vectors held in final fields of LookupTables might not be treated as
> constant.
> Even though the LookupTables instance is held in a static field of the
> benchmark, HotSpot does not by default propagate to final fields.
> It might hoist the values outside the loop though (need to verify).
> (There is an ongoing bug to track support for final fields being really
> final. It’s complicated due to reflection, and deserialization.)
>
> 3. Masked loads are not yet optimal (but since this is performed at the
> end the impact is likely minimal).
>
>
> Digging deeper and focusing on just ASCII (using 20k.txt) I think there is
> an issue with the way C2 handles constant vectors like zero (could be a
> regression), which causes the values to be spilled on the stack which seems
> to cause other spills.
>
> So, perversely, let's create the zero vector from an array. Here’s your
> method just focusing on ASCII:
>
> public static boolean validate(byte[] buf, VectorSpecies<Byte> species, LookupTables lut) {
> // ByteVector zero = ByteVector.zero(species);
> ByteVector zero = ByteVector.fromArray(species, new byte[species.length()], 0);
> ByteVector error = zero;
> Vector<Byte> prevIncomplete = zero;
>
> int i = 0;
> for (; i < species.loopBound(buf.length); i += species.length()) {
> ByteVector input = ByteVector.fromArray(species, buf, i);
>
> boolean isUTF8 = input.compare(LT, zero).anyTrue();
> if (!isUTF8) {
> error = error.or(prevIncomplete);
> }
> }
>
> VectorMask<Byte> m = species.indexInRange(i, buf.length);
> ByteVector input = ByteVector.fromArray(species, buf, i, m);
> boolean isUTF8 = input.compare(LT, zero).anyTrue();
>
> error = error.or(prevIncomplete);
> return error.compare(EQ, zero).allTrue();
> }
>
>
> And run using a recent build of 17. The hot loop is:
>
>
> 3.35% ↗ 0x000000011a7dcf40: cmp %r11d,%r9d
> │ 0x000000011a7dcf43: jae 0x000000011a7dd578
> 2.73% │ 0x000000011a7dcf49: mov 0x20(%rsp),%rcx
> 7.02% │ 0x000000011a7dcf4e: vmovdqu 0x10(%rcx,%r9,1),%ymm2
> 8.31% │ 0x000000011a7dcf55: vpcmpgtb %ymm2,%ymm3,%ymm2
> 10.43% │ 0x000000011a7dcf59: vptest %ymm0,%ymm2
> 13.04% │ 0x000000011a7dcf5e: setne %cl
> 10.57% │ 0x000000011a7dcf61: movzbl %cl,%ecx
> 6.82% │ 0x000000011a7dcf64: test %ecx,%ecx
> │ 0x000000011a7dcf66: jne 0x000000011a7dd5a0
> 6.59% │ 0x000000011a7dcf6c: mov 0x118(%r15),%rcx
> 6.84% │ 0x000000011a7dcf73: vpor %ymm3,%ymm1,%ymm1
> 3.51% │ 0x000000011a7dcf77: add 0x18(%rsp),%r9d
> 2.71% │ 0x000000011a7dcf7c: test %eax,(%rcx)
> 7.60% │ 0x000000011a7dcf7e: xchg %ax,%ax
> 7.58% │ 0x000000011a7dcf80: cmp %r10d,%r9d
> ╰ 0x000000011a7dcf83: jl 0x000000011a7dcf40
>
>
> That ok, not great, HotSpot does not unroll, there are redundant bound
> checks, the species length is spilled on the stack, and there appears to be
> a safe point check.
>
> Something ain’t quite right. I think the loop shape is being “polluted" by
> the processing of the array tail after the loop (confirmed by removing the
> array tail processing).
>
> However, things get really bad if we swap in zero created from the
> species, then the performance nose dives by ~7x and there are many spills
> in the hot loop.
>
> We need a C2 expert to look more closely at why:
>
> 1. The loop shape is being affect by processing outside of the loop
> 2. Why use of the idiomatic zero vector causes so many spills.
>
> Paul.
>
>
> On Mar 5, 2021, at 9:24 AM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>
> Hi August,
>
> Thank you for bringing this to the list (I saw your messages on twitter
> and was gonna suggest you do just that but you got there before me).
>
> This is exactly the kind of thing we are looking for to exercise the API
> and find performance issues. I shall take a closer look.
>
> We have been methodically working through some performance issues based on
> other use cases, I think we will get there.
>
> Paul,
>
> On Mar 4, 2021, at 3:49 PM, August Nagro <augustnagro at gmail.com> wrote:
>
> Hello,
>
> A while back I implemented simd-json's UTF-8 validation using the
> vector API. It could be considered the first step towards implementing
> simd-json completely with Java.
>
> The simd-json developers seem interested, which is cool. The only
> problem is that it's very slow, and I don't have the knowledge to make
> it faster. Hopefully I can get away with saying it's the Vector api's
> fault and not mine. :)
>
> If anyone has suggestions or is interested in grocking the code
> (there's not much of it), this is the github repo:
> https://github.com/AugustNagro/utf8.java
>
> Cheers,
>
> August
>
>
>
>
More information about the panama-dev
mailing list