UTF-8 Validation with the Vector API (Performance)

Paul Sandoz paul.sandoz at oracle.com
Mon Mar 8 22:01:42 UTC 2021


Hmm… for some reason quoting has been removed making it really hard for you to read my inline replies :-(
Unsure what changed to cause this (Mac Mailer, sendmail server, or openjdk mail server…)

Paul.


> On Mar 8, 2021, at 1:02 PM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
> 
> 
> 
> On Mar 8, 2021, at 11:57 AM, August Nagro <augustnagro at gmail.com<mailto:augustnagro at gmail.com>> wrote:
> 
> That's awesome, thanks for taking a look!
> 
>> Vector.slice(int origin, Vector<E> v1) is not currently optimized.
> 
> This is how simd-json implements rotate:
> 
> template<int N=1>
> simdjson_really_inline simd8<T> prev(const simd8<T> prev_chunk) const {
>  return _mm256_alignr_epi8(*this, _mm256_permute2x128_si256(prev_chunk, *this, 0x21), 16 - N);
> }
> 
> It's two instructions which is great, and if you invert the method it would work for a subset of slice origins.
> 
> https://github.com/simdjson/simdjson/blob/master/include/simdjson/haswell/simd.h#L52<https://urldefense.com/v3/__https://github.com/simdjson/simdjson/blob/master/include/simdjson/haswell/simd.h*L52__;Iw!!GqivPVa7Brio!LaOiN2w1Jw1bNxdWTUDkUx2D50yg0QMVwRiq09OjYrUCRFuwEAZYlRN_q6BMs6GDzw$>
> 
> 
> Thanks, useful to know when we get around to optimizing this method.
> 
> 
>> Vectors held in final fields of LookupTables might not be treated as constant.
> 
> Ah, I forgot about that.
> 
>> there is an issue with the way C2 handles constant vectors like zero
> 
> Really interesting findings. When I ran my benchmark in December I got 50_000 ops/sec for the all-ascii 20k.txt, but when I tried again with a fresh build on March 3rd it was only 15_145. So perhaps there was a regression in that time,
> 
> Yes, I suspect so.
> 
> 
> but I also wonder if the vectorIntrinsics branch (which I built) is missing Vladimir's branch-prediction patch. It's a little confusing to me how two different git repos are being used for the same project.
> 
> In jdk/jdk: https://github.com/openjdk/jdk/commit/28fcb5aebf8885c63ce97a064a1d8e4ef89b0a33<https://urldefense.com/v3/__https://github.com/openjdk/jdk/commit/28fcb5aebf8885c63ce97a064a1d8e4ef89b0a33__;!!GqivPVa7Brio!LaOiN2w1Jw1bNxdWTUDkUx2D50yg0QMVwRiq09OjYrUCRFuwEAZYlRN_q6BzkRasjQ$>
> 
> That is from Vladimir’s branch:
> 
> https://github.com/iwanowww/jdk/commit/28fcb5aebf8885c63ce97a064a1d8e4ef89b0a33
> https://github.com/openjdk/jdk/compare/master...iwanowww:vector.phi
> 
> AFAICT has not been committed to jdk/master, nor to panama-vector/vectorIntrinsics. Vladimir what’s the status of this, still too experimental?
> 
> 
> In panama-vector (not found): https://github.com/openjdk/panama-vector/commit/28fcb5aebf8885c63ce97a064a1d8e4ef89b0a33<https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/commit/28fcb5aebf8885c63ce97a064a1d8e4ef89b0a33__;!!GqivPVa7Brio!LaOiN2w1Jw1bNxdWTUDkUx2D50yg0QMVwRiq09OjYrUCRFuwEAZYlRN_q6Bd-QIgMw$>
> 
> 
> The panama-vector/vectorIntrinsics branch has additional features, API or otherwise, that may be experimental, or need time to bake, before we bring them into the main repository (some of which will be brought in via a JEP).
> 
> We will often fix issues directly in jdk/master, which make their way into the panama-vector/vectorIntrinsics when we merge (most recent merge occurred on March 7th).
> 
> Generally, you can consider jdk/master to be a subset of panama-vector/vectorIntrinsics. As such there may be performance differences between the two.
> 
> Hth,
> Paul.
> 
> 
> Regards,
> 
> August
> 
> 
> On Fri, Mar 5, 2021 at 12:59 PM Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote:
> Looking at the code I spot three general issues with the Vector API:
> 
> 1. Vector.slice(int origin, Vector<E> v1) is not currently optimized.
> We need to fix this.
> 
> 2. Vectors held in final fields of LookupTables might not be treated as constant.
> Even though the LookupTables instance is held in a static field of the benchmark, HotSpot does not by default propagate to final fields.
> It might hoist the values outside the loop though (need to verify).
> (There is an ongoing bug to track support for final fields being really final. It’s complicated due to reflection, and deserialization.)
> 
> 3. Masked loads are not yet optimal (but since this is performed at the end the impact is likely minimal).
> 
> 
> Digging deeper and focusing on just ASCII (using 20k.txt) I think there is an issue with the way C2 handles constant vectors like zero (could be a regression), which causes the values to be spilled on the stack which seems to cause other spills.
> 
> So, perversely, let's create the zero vector from an array. Here’s your method just focusing on ASCII:
> 
> 
>    public static boolean validate(byte[] buf, VectorSpecies<Byte> species, LookupTables lut) {
> //        ByteVector zero = ByteVector.zero(species);
>        ByteVector zero = ByteVector.fromArray(species, new byte[species.length()], 0);
>        ByteVector error = zero;
>        Vector<Byte> prevIncomplete = zero;
> 
>        int i = 0;
>        for (; i < species.loopBound(buf.length); i += species.length()) {
>            ByteVector input = ByteVector.fromArray(species, buf, i);
> 
>            boolean isUTF8 = input.compare(LT, zero).anyTrue();
>            if (!isUTF8) {
>                error = error.or(prevIncomplete);
>            }
>        }
> 
>        VectorMask<Byte> m = species.indexInRange(i, buf.length);
>        ByteVector input = ByteVector.fromArray(species, buf, i, m);
>        boolean isUTF8 = input.compare(LT, zero).anyTrue();
> 
>        error = error.or(prevIncomplete);
>        return error.compare(EQ, zero).allTrue();
>    }
> 
> And run using a recent build of 17. The hot loop is:
> 
> 
> 3.35%  ↗  0x000000011a7dcf40:   cmp    %r11d,%r9d
>        │  0x000000011a7dcf43:   jae    0x000000011a7dd578
> 2.73%  │  0x000000011a7dcf49:   mov    0x20(%rsp),%rcx
> 7.02%  │  0x000000011a7dcf4e:   vmovdqu 0x10(%rcx,%r9,1),%ymm2
> 8.31%  │  0x000000011a7dcf55:   vpcmpgtb %ymm2,%ymm3,%ymm2
> 10.43%  │  0x000000011a7dcf59:   vptest %ymm0,%ymm2
> 13.04%  │  0x000000011a7dcf5e:   setne  %cl
> 10.57%  │  0x000000011a7dcf61:   movzbl %cl,%ecx
> 6.82%  │  0x000000011a7dcf64:   test   %ecx,%ecx
>        │  0x000000011a7dcf66:   jne    0x000000011a7dd5a0
> 6.59%  │  0x000000011a7dcf6c:   mov    0x118(%r15),%rcx
> 6.84%  │  0x000000011a7dcf73:   vpor   %ymm3,%ymm1,%ymm1
> 3.51%  │  0x000000011a7dcf77:   add    0x18(%rsp),%r9d
> 2.71%  │  0x000000011a7dcf7c:   test   %eax,(%rcx)
> 7.60%  │  0x000000011a7dcf7e:   xchg   %ax,%ax
> 7.58%  │  0x000000011a7dcf80:   cmp    %r10d,%r9d
>        ╰  0x000000011a7dcf83:   jl     0x000000011a7dcf40
> 
> 
> That ok, not great, HotSpot does not unroll, there are redundant bound checks, the species length is spilled on the stack, and there appears to be a safe point check.
> 
> Something ain’t quite right. I think the loop shape is being “polluted" by the processing of the array tail after the loop (confirmed by removing the array tail processing).
> 
> However, things get really bad if we swap in zero created from the species, then the performance nose dives by ~7x and there are many spills in the hot loop.
> 
> We need a C2 expert to look more closely at why:
> 
> 1. The loop shape is being affect by processing outside of the loop
> 2. Why use of the idiomatic zero vector causes so many spills.
> 
> Paul.
> 
> 
> On Mar 5, 2021, at 9:24 AM, Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote:
> 
> Hi August,
> 
> Thank you for bringing this to the list (I saw your messages on twitter and was gonna suggest you do just that but you got there before me).
> 
> This is exactly the kind of thing we are looking for to exercise the API and find performance issues. I shall take a closer look.
> 
> We have been methodically working through some performance issues based on other use cases, I think we will get there.
> 
> Paul,
> 
> On Mar 4, 2021, at 3:49 PM, August Nagro <augustnagro at gmail.com<mailto:augustnagro at gmail.com>> wrote:
> 
> Hello,
> 
> A while back I implemented simd-json's UTF-8 validation using the
> vector API. It could be considered the first step towards implementing
> simd-json completely with Java.
> 
> The simd-json developers seem interested, which is cool. The only
> problem is that it's very slow, and I don't have the knowledge to make
> it faster. Hopefully I can get away with saying it's the Vector api's
> fault and not mine. :)
> 
> If anyone has suggestions or is interested in grocking the code
> (there's not much of it), this is the github repo:
> https://github.com/AugustNagro/utf8.java<https://urldefense.com/v3/__https://github.com/AugustNagro/utf8.java__;!!GqivPVa7Brio!LaOiN2w1Jw1bNxdWTUDkUx2D50yg0QMVwRiq09OjYrUCRFuwEAZYlRN_q6DsXXOTpw$>
> 
> Cheers,
> 
> August



More information about the panama-dev mailing list