Vectorized BitSet operations

Fri Nov 14 19:33:36 UTC 2014

Hi Jerven,

(I thought I was reading email from hotspot-compiler-dev and then realized
it's graal-dev...but here I go anyway)

If you're interested in taking HotSpot C2's auto vectorization for a
reference, here's the RFE that implemented the vectorization feature:
JDK-6340864: Implement vectorization optimizations in hotspot-server
[1][2]. It's mostly still based on the superword algorithm described in [3].

I don't think Graal implements that algorithm now. It certainly would be
useful to have it.

Graal does use vector (SIMD) instructions for some intrinsics, e.g.
ArrayEqualsOp and UnsafeArrayCopySnippet. If you want to improve
performance for specific methods, this is the easier way to go, with good
existing examples.

- Kris

[1]: https://bugs.openjdk.java.net/browse/JDK-6340864
[2]: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/006050192a5a
[3]: Exploiting superword level parallelism with multimedia instruction
sets, http://dl.acm.org/citation.cfm?id=358438.349320

On Fri, Nov 14, 2014 at 8:38 AM, Jerven Bolleman <me at jerven.eu> wrote:

> Hi All,
>
> My apologies in advance if I am a rather clueless.
>
> I am interested in seeing the use of SSE/AVX instructions use for some
> methods in java.util.Bitset. e.g. the "or" and "and" methods.
>
> For example I believe that this loop
> <
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/BitSet.java#681
> >
> can be vectorized.
>
> My first thought was that an intrinsic/MethodSubstitution would be a good
> start.
> However, it would be even nicer if this type of code can be auto
> vectorized.
> I was wondering if anyone has worked on this or if this is already the
> case?
>
> Because working auto vectorization could accelerate this loop
> <
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/3.0.1/org/apache/lucene/util/OpenBitSet.java#661
> >
> used in Lucene
> a lot. While the intrinsic has a smaller use case.
>
> In my specific case a bitset of 20mb or larger is not uncommon. So in these
> cases doing more work per tick is interesting.
>
> If I wanted to have a go at implementing the intrinsic are there any
> pitfalls I need to be aware of?
>
> Regards,
> Jerven
>
>
>
>
>
>
>
>
>
> --
> Jerven Bolleman
> me at jerven.eu
>