UTF-8 Validation with the Vector API (Performance)
Paul Sandoz
paul.sandoz at oracle.com
Fri Mar 5 20:58:55 UTC 2021
Looking at the code I spot three general issues with the Vector API:
1. Vector.slice(int origin, Vector<E> v1) is not currently optimized.
We need to fix this.
2. Vectors held in final fields of LookupTables might not be treated as constant.
Even though the LookupTables instance is held in a static field of the benchmark, HotSpot does not by default propagate to final fields.
It might hoist the values outside the loop though (need to verify).
(There is an ongoing bug to track support for final fields being really final. It’s complicated due to reflection, and deserialization.)
3. Masked loads are not yet optimal (but since this is performed at the end the impact is likely minimal).
Digging deeper and focusing on just ASCII (using 20k.txt) I think there is an issue with the way C2 handles constant vectors like zero (could be a regression), which causes the values to be spilled on the stack which seems to cause other spills.
So, perversely, let's create the zero vector from an array. Here’s your method just focusing on ASCII:
public static boolean validate(byte[] buf, VectorSpecies<Byte> species, LookupTables lut) {
// ByteVector zero = ByteVector.zero(species);
ByteVector zero = ByteVector.fromArray(species, new byte[species.length()], 0);
ByteVector error = zero;
Vector<Byte> prevIncomplete = zero;
int i = 0;
for (; i < species.loopBound(buf.length); i += species.length()) {
ByteVector input = ByteVector.fromArray(species, buf, i);
boolean isUTF8 = input.compare(LT, zero).anyTrue();
if (!isUTF8) {
error = error.or(prevIncomplete);
}
}
VectorMask<Byte> m = species.indexInRange(i, buf.length);
ByteVector input = ByteVector.fromArray(species, buf, i, m);
boolean isUTF8 = input.compare(LT, zero).anyTrue();
error = error.or(prevIncomplete);
return error.compare(EQ, zero).allTrue();
}
And run using a recent build of 17. The hot loop is:
3.35% ↗ 0x000000011a7dcf40: cmp %r11d,%r9d
│ 0x000000011a7dcf43: jae 0x000000011a7dd578
2.73% │ 0x000000011a7dcf49: mov 0x20(%rsp),%rcx
7.02% │ 0x000000011a7dcf4e: vmovdqu 0x10(%rcx,%r9,1),%ymm2
8.31% │ 0x000000011a7dcf55: vpcmpgtb %ymm2,%ymm3,%ymm2
10.43% │ 0x000000011a7dcf59: vptest %ymm0,%ymm2
13.04% │ 0x000000011a7dcf5e: setne %cl
10.57% │ 0x000000011a7dcf61: movzbl %cl,%ecx
6.82% │ 0x000000011a7dcf64: test %ecx,%ecx
│ 0x000000011a7dcf66: jne 0x000000011a7dd5a0
6.59% │ 0x000000011a7dcf6c: mov 0x118(%r15),%rcx
6.84% │ 0x000000011a7dcf73: vpor %ymm3,%ymm1,%ymm1
3.51% │ 0x000000011a7dcf77: add 0x18(%rsp),%r9d
2.71% │ 0x000000011a7dcf7c: test %eax,(%rcx)
7.60% │ 0x000000011a7dcf7e: xchg %ax,%ax
7.58% │ 0x000000011a7dcf80: cmp %r10d,%r9d
╰ 0x000000011a7dcf83: jl 0x000000011a7dcf40
That ok, not great, HotSpot does not unroll, there are redundant bound checks, the species length is spilled on the stack, and there appears to be a safe point check.
Something ain’t quite right. I think the loop shape is being “polluted" by the processing of the array tail after the loop (confirmed by removing the array tail processing).
However, things get really bad if we swap in zero created from the species, then the performance nose dives by ~7x and there are many spills in the hot loop.
We need a C2 expert to look more closely at why:
1. The loop shape is being affect by processing outside of the loop
2. Why use of the idiomatic zero vector causes so many spills.
Paul.
On Mar 5, 2021, at 9:24 AM, Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote:
Hi August,
Thank you for bringing this to the list (I saw your messages on twitter and was gonna suggest you do just that but you got there before me).
This is exactly the kind of thing we are looking for to exercise the API and find performance issues. I shall take a closer look.
We have been methodically working through some performance issues based on other use cases, I think we will get there.
Paul,
On Mar 4, 2021, at 3:49 PM, August Nagro <augustnagro at gmail.com<mailto:augustnagro at gmail.com>> wrote:
Hello,
A while back I implemented simd-json's UTF-8 validation using the
vector API. It could be considered the first step towards implementing
simd-json completely with Java.
The simd-json developers seem interested, which is cool. The only
problem is that it's very slow, and I don't have the knowledge to make
it faster. Hopefully I can get away with saying it's the Vector api's
fault and not mine. :)
If anyone has suggestions or is interested in grocking the code
(there's not much of it), this is the github repo:
https://github.com/AugustNagro/utf8.java
Cheers,
August
More information about the panama-dev
mailing list