[vector] some notes on bitwise and reduction operations

Wed May 16 21:45:46 UTC 2018

Hi Paul,

Thanks for these observations. Here are my thoughts on some of these.

Bit shifting:
- x86 hardware does not support shifting byte by scalar. Also, by default on x86, the shift count saturates while java semantics truncate. That said, currently short shifting by scalar is indeed supported via Vector API intrinsification due to fact that auto-vectorizer also supports it. This has some limitations too - for example, logical right shift is not supported due to mismatch between java semantics which sign extend short/byte into int before shifting while vector instruction would simply want to shift in zeros. (However, we can change our java implementation to match this).
- Intrinsification of shifts by scalar does indeed mask off the shift count by either 0x1f (int) or 0x3f (long). That said, your comment allowed me to discover an issue with variable shifts where this behavior does not occur. Namely, saturation is used instead of truncation.

My suggestions on bit shifting:
- Write shift tests that cover all of these corner cases.
- Fix variable shifting to mask off shift counts.
- Update spec to mention the shift count masking (which we keep same as scalar in order to allow reasoning between semantics of scalar vs vector instructions).
- Either get rid of byte and short shift by scalar OR Plan to implement byte shift by scalar via emulation in future and update logical right shift default implementation to special case so that sign bit is not extended to int before shift.

Bit rotating:

- Seems indeed that byte and short implementations are incorrect.
- Intrinsifying these should be possible by implementing with Vector API using a combination of logical left shift, logical right shift, plus a final or to get results together. So the main question that remains is whether byte and short shifts are supported in API. If yes, then rotates can be supported. Otherwise, no.
-Adding mask variant should probably be added for completeness since most other API methods seems to have a masked variant. In general, we support all mask variants using blends right now but in future we may do it with actual masks.

Reduction:

- On whether we need mask variants, to me it does appear that for completeness and consistency, we probably should have them. The implementation would indeed be what you pasted where identity depends on operation (zero for adds, ones for multiplies).
- On ordering, I think two options seems okay for now - sequential from left to right. And unordered which would be less well defined. The option that actually defines ordering in another manner would definitely be quite a bit more complex to implement.

Anyway, thanks again for looking deeply at these issues. Let me know what you think about my answers, especially around shifts since we have a bit of fixing to do there.

Thanks!

--Razvan

-----Original Message-----
From: panama-dev [mailto:panama-dev-bounces at openjdk.java.net] On Behalf Of Paul Sandoz
Sent: Monday, May 14, 2018 5:56 PM
To: panama-dev at openjdk.java.net
Subject: [vector] some notes on bitwise and reduction operations

Hi,

While fleshing out the JavaDoc i made some notes on the bitwise and reduction operations:

Bit shifting:

- Will shifting byte and short vectors by a scalar be supported? 
If so we need to be careful with the specification. In Java a byte or short is promoted to an int value before shifting. If we want to be equivalent in Java we may need to specify this as equivalent to casting to an int vector then down casting. But i am unsure what the vector hardware instructions do.

- In Java the int and long primitive shifting operations take the first 5 or 6 bits of the value to shift. What do the vector hardware instructions do? I hope/suspect the same but need to double check...

Bit rotating:

- Will these be made intrinsic? Currently i specified them for int and long (via the Integer/Long.rotateLeft/Right methods), and they are incorrect for byte and short, due to upcasting to int and then truncating on a downcast.

- Do we require mask variants?

Reduction:

- Do we require mask variants? The implementation can blend in the identity value for masked lanes.

  v.blend(SPECIES.broadcast(identity), mask).addAll();

- I am still noodling on the best way to specify reductive operations for floating point vectors.
I am leaning towards specifying that the default be lane elements are processed in sequential order from left to right.
An option would be required to declare that order is not important and an undefined algorithm can be utilized that may leverage data parallelism but could produce different results to the default, and may not be stable over releases.

Paul.