Vectorized BitSet operations

Mon Nov 17 16:26:58 UTC 2014

Hi Krystal,

Thanks for this getting started pointer. I feel I am making good progress
in the five minutes here and there.

I must say I am impressed in how easy it is to read and work with Graal
Compiler code as a Java dev. Its much more approachable than the C++
Hotspot code, because now I need to learn assembly. For HotSpot I would
need to learn C++ and assembly.

I am currently investigating code that looks like the ArrayEqualsOp and
will try the snippet approach as well. To see what is nicer and more
maintainable.

I am quite slow because I actually am learning assembly as I go along.
Which comes to the following remark, javadoc for the assembly ops would be
really nice. Now I need to keep the manual open at all times.

Also grouping operations to the architecture sub version they belong in
will help noobs like me at this level as well. e.g. masm.andpd() could be
masm.sse2().andpd().

This leads me to the next question, how should I write performance/unit
tests for intrinsics? Considering that intrinsics should be removed the
moment the compiler generates better/equivalent code.

Regards,
Jerven

On Fri, Nov 14, 2014 at 8:33 PM, Krystal Mok <rednaxelafx at gmail.com> wrote:
Hi Jerven,

(I thought I was reading email from hotspot-compiler-dev and then realized
it's graal-dev...but here I go anyway)

If you're interested in taking HotSpot C2's auto vectorization for a
reference, here's the RFE that implemented the vectorization feature:
JDK-6340864: Implement vectorization optimizations in hotspot-server
[1][2]. It's mostly still based on the superword algorithm described in [3].

I don't think Graal implements that algorithm now. It certainly would be
useful to have it.

Graal does use vector (SIMD) instructions for some intrinsics, e.g.
ArrayEqualsOp and UnsafeArrayCopySnippet. If you want to improve
performance for specific methods, this is the easier way to go, with good
existing examples.

- Kris

[1]: https://bugs.openjdk.java.net/browse/JDK-6340864
[2]: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/006050192a5a
[3]: Exploiting superword level parallelism with multimedia instruction
sets, http://dl.acm.org/citation.cfm?id=358438.349320

On Fri, Nov 14, 2014 at 8:38 AM, Jerven Bolleman <me at jerven.eu> wrote:
Hi All,

My apologies in advance if I am a rather clueless.

I am interested in seeing the use of SSE/AVX instructions use for some
methods in java.util.Bitset. e.g. the "or" and "and" methods.

For example I believe that this loop
<
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/BitSet.java#681
>
can be vectorized.

My first thought was that an intrinsic/MethodSubstitution would be a good
start.
However, it would be even nicer if this type of code can be auto vectorized.
I was wondering if anyone has worked on this or if this is already the case?

Because working auto vectorization could accelerate this loop
<
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/3.0.1/org/apache/lucene/util/OpenBitSet.java#661
>
used in Lucene
a lot. While the intrinsic has a smaller use case.

In my specific case a bitset of 20mb or larger is not uncommon. So in these
cases doing more work per tick is interesting.

If I wanted to have a go at implementing the intrinsic are there any
pitfalls I need to be aware of?

Regards,
Jerven

--
Jerven Bolleman
me at jerven.eu

-- 
Jerven Bolleman
me at jerven.eu