<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Aptos;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
font-size:12.0pt;
font-family:"Aptos",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0cm;
margin-right:0cm;
margin-bottom:0cm;
margin-left:36.0pt;
font-size:12.0pt;
font-family:"Aptos",sans-serif;}
span.EmailStyle18
{mso-style-type:personal-reply;
font-family:"Aptos",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
mso-fareast-language:EN-US;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:1435442340;
mso-list-type:hybrid;
mso-list-template-ids:-986293302 -238627866 1074331651 1074331653 1074331649 1074331651 1074331653 1074331649 1074331651 1074331653;}
@list l0:level1
{mso-level-start-at:0;
mso-level-number-format:bullet;
mso-level-text:;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:Wingdings;
mso-fareast-font-family:Aptos;
mso-bidi-font-family:"Times New Roman";}
@list l0:level2
{mso-level-number-format:bullet;
mso-level-text:o;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:"Courier New";}
@list l0:level3
{mso-level-number-format:bullet;
mso-level-text:;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:Wingdings;}
@list l0:level4
{mso-level-number-format:bullet;
mso-level-text:;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:Symbol;}
@list l0:level5
{mso-level-number-format:bullet;
mso-level-text:o;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:"Courier New";}
@list l0:level6
{mso-level-number-format:bullet;
mso-level-text:;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:Wingdings;}
@list l0:level7
{mso-level-number-format:bullet;
mso-level-text:;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:Symbol;}
@list l0:level8
{mso-level-number-format:bullet;
mso-level-text:o;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:"Courier New";}
@list l0:level9
{mso-level-number-format:bullet;
mso-level-text:;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:Wingdings;}
@list l1
{mso-list-id:1755472652;
mso-list-type:hybrid;
mso-list-template-ids:831272070 -2053889086 1074331651 1074331653 1074331649 1074331651 1074331653 1074331649 1074331651 1074331653;}
@list l1:level1
{mso-level-start-at:0;
mso-level-number-format:bullet;
mso-level-text:-;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:"Aptos",sans-serif;
mso-fareast-font-family:Aptos;
mso-bidi-font-family:"Times New Roman";}
@list l1:level2
{mso-level-number-format:bullet;
mso-level-text:o;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:"Courier New";}
@list l1:level3
{mso-level-number-format:bullet;
mso-level-text:;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:Wingdings;}
@list l1:level4
{mso-level-number-format:bullet;
mso-level-text:;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:Symbol;}
@list l1:level5
{mso-level-number-format:bullet;
mso-level-text:o;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:"Courier New";}
@list l1:level6
{mso-level-number-format:bullet;
mso-level-text:;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:Wingdings;}
@list l1:level7
{mso-level-number-format:bullet;
mso-level-text:;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:Symbol;}
@list l1:level8
{mso-level-number-format:bullet;
mso-level-text:o;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:"Courier New";}
@list l1:level9
{mso-level-number-format:bullet;
mso-level-text:;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;
font-family:Wingdings;}
ol
{margin-bottom:0cm;}
ul
{margin-bottom:0cm;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-IN" link="blue" vlink="purple" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">Hi Quan,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">Currently VectorAPI library includes following APIs which can be used to detect and wrap exceptional indices into valid vector index range before waiting for such conversion to be performed just
before its usage as part of shuffle consuming operations.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<ul style="margin-top:0cm" type="disc">
<li class="MsoListParagraph" style="margin-left:0cm;mso-list:l1 level1 lfo1"><span style="mso-fareast-language:EN-US">VectorShuffle.wrapIndexes() - conversion of exceptional indexes
<o:p></o:p></span></li><li class="MsoListParagraph" style="margin-left:0cm;mso-list:l1 level1 lfo1"><span style="mso-fareast-language:EN-US">VectorShuffle.checkIndiex() - detection of exceptional indexes.
<o:p></o:p></span></li></ul>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">If shuffle is loop variant then performing in-range wrapping once may benefit all successive re-arranges consuming it by saving additional OOB index comparison check, so what you are proposing a
strict form of rearrange which always receive wrapped indexes. But we can always relay on C2 GVN to auto-perform the sharing magic and weed out similar comparisons as it done upfront before inline expansion of re-arrange. In case every re-arrange in the
loop consumes a different loop variant shuffle then obviously we will not reap any benefits.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">Off-late we adopted wrapping as default semantics for both forms of re-arrange i.e. single and double vector operations, earlier exceptional indices we not acceptable we used to throw IndexOutOfBounds
exception on encountering them.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal">> An important question is that what we should do with the hypothetical 2048-bit byte VectorShuffles. We cannot wrap those to [0, 2 * VLENGTH - 1] because of implementation limitations. Should then we disallow all operations that would
observe that (toVector, laneSource, 2-operand rearrange)?<o:p></o:p></p>
<p class="MsoListParagraph"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">This is a known limitation for shuffle and will also affect two vectors rearrange as vector shuffle backing storage cannot accommodate such a large index range and our recent shuffle overall targeting
JDK mainline also does not address this, in fact this was one of our motive for adding a new selectFrom API operating across two vectors using lookup index vector and internally wrap its lanes to valid two vector index range, but there is no way to create
index vector (ByteVector) whose lanes holds a value greater than 127. <o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">jshell> ByteVector.broadcast(ByteVector.SPECIES_512, 128)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">| Exception java.lang.IllegalArgumentException: Vector creation failed: value 128 cannot be represented in ETYPE byte; result of cast is -128<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">| at AbstractSpecies.badElementBits (AbstractSpecies.java:383)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">| at ByteVector$ByteSpecies.longToElementBits (ByteVector.java:3863)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">| at ByteVector$ByteSpecies.broadcast (ByteVector.java:3853)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">| at ByteVector.broadcast (ByteVector.java:515)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">| at (#4:1)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">Best Regards,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">Jatin <o:p></o:p></span></p>
<p class="MsoListParagraph"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoListParagraph"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<div style="border:none;border-left:solid blue 1.5pt;padding:0cm 0cm 0cm 4.0pt">
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif">From:</span></b><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif"> panama-dev <panama-dev-retn@openjdk.org>
<b>On Behalf Of </b>Quân Anh Mai<br>
<b>Sent:</b> Thursday, December 19, 2024 10:33 PM<br>
<b>To:</b> Paul Sandoz <paul.sandoz@oracle.com><br>
<b>Cc:</b> panama-dev@openjdk.java.net<br>
<b>Subject:</b> Re: Improve the efficiency of VectorShuffle usage<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal">Thanks a lot for your response,<o:p></o:p></p>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Actually, we do not need to wrap at all on construction, a VectorShuffle is a black box, from the creation of a VectorShuffle to its usage there is somewhere in between when we do the wrap, and that is totally enough.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">From your response, I think a more reasonable proposal would be: On VectorShuffle construction, we wrap all indices to [0, 2 * VLENGTH - 1] (instead of the current model of wrapping oob indices to [-VLENGTH, -1]) and when using that VectorShuffle
for a 1-operand rearrange, we wrap the indices to [0, VLENGTH - 1].<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Implementation-wise, we do not do any wrapping on construction, and for operations that observe the VectorShuffle instance, we wrap at those places. This allows us to reduce the number of wrapping to the minimum for the most frequent operations.
For other operations, the wrapping operation is itself cheap and should be GVN-ed.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">An important question is that what we should do with the hypothetical 2048-bit byte VectorShuffles. We cannot wrap those to [0, 2 * VLENGTH - 1] because of implementation limitations. Should then we disallow all operations that would observe
that (toVector, laneSource, 2-operand rearrange)?<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Cheers,<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Quan Anh<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Thu, 19 Dec 2024 at 03:45, Paul Sandoz <<a href="mailto:paul.sandoz@oracle.com" target="_blank">paul.sandoz@oracle.com</a>> wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
<p class="MsoNormal" style="margin-bottom:12.0pt">Hi Quan,<br>
<br>
I believe at the moment the selectFrom operations are currently more optimal. Even though we specify behavior in terms of rearrange we currently optimize differently. Ideally selectFrom would be implemented using the rearrange expression and that would be optimal
for such use.<br>
<br>
In our prior iteration we focused on changing the behavior of the selectFrom and rearrange operations, and left out the shuffle conversions and factories knowing that we would need to address them later. It’s good you are bringing this topic up. John, I, and
others had some prior discussions on what to do regarding identifying wrapped and partially wrapped state of a shuffle, but IIRC nothing definitive was concluded as to how to do this, e.g., something internal, such as an internal type or some state on shuffle.<br>
<br>
Regardless of the underlying representation and optimizing approach I think we need to have factories that by default wrap and/or consistently accept a wrapping argument.<br>
<br>
IIRC we discussed whether it was possible to implicitly detect in C2 if a shuffle is (fully) wrapped on construction and therefore we don’t need to rewrap on use. That might work for the loop-variant case you mention? but I imagine is harder if we use constant
shuffles where it would be beneficial to clearly identify the representtion in compiled code (possibly to hoist out of loops?).<br>
<br>
We did not discuss a separation of public types, nor a distinction of use of a single type between the two cases. I admit to not being fond of either and would prefer a solution where shuffles could be used in either case regardless of their wrapping status
in line to what is currently specified.<br>
<br>
<br>
I wonder if we will ever encounter 2048-bit size vectors on ARM cores in the industry, it’s still early days but convergence around 128-bit and maybe 256-bit seems to be the norm as vendors balance core size and power consumption. Perhaps it may be so for more
specialized hardware, although it’s more likely in that case one would encounter a GPU or a tensor/matrix core instead. We shall see! However, that’s not to say we should ignore it, but IMO we should for now avoid giving it undue prominence.
<br>
<br>
<br>
Separately, your proposal on the shuffle conversions and factories addresses the fact that Float/DoubleVector sits awkwardly with shuffles, such use likely indicates incorrect use. Java's type system is unfortunately not expressive enough to express the constraints
we want so limiting conversion/construction to integral species seems like a good idea, with casting for use with float-point vectors (as in anyway required in some cases for masks).<br>
<br>
Paul.<br>
<br>
<br>
> On Dec 17, 2024, at 11:27<span style="font-family:"Arial",sans-serif"> </span>PM, Quân Anh Mai <<a href="mailto:anhmdq@gmail.com" target="_blank">anhmdq@gmail.com</a>> wrote:<br>
> <br>
> Hi,<br>
> <br>
> I want to discuss how to improve the efficiency of creating and using a VectorShuffle.<br>
> <br>
> Currently, when a VectorShuffle is created, it will try to wrap all out-of-bound indices into the interval [-VLENGTH, -1]. However, this is useless most of the time, as the most frequent operation with a VectorShuffle, rearrange on a single vector, will wrap
all indices to the interval [0, VLENGTH - 1] regardless. This may be noticeable in look up algorithms such as UTF-8 validation, as in those algorithms, the VectorShuffle is a loop-variant and will be computed in each iteration. This is a consequence of the
fact that VectorShuffle is used for 2 operations: 1-operand rearrange and 2-operand rearrange.<br>
> <br>
> As a result, I propose we add a field to VectorShuffle to discriminate instances which are used for 1-operand rearrange and instances which are used for 2-operand rearrange. For instances which are created for 1-operand rearrange, all indices are wrapped
to [0, VLENGTH - 1] while for ones which are for 2-operand rearrange, the interval for indices to wrap to is [0, 2 * VLENGTH - 1]. Instances which are created for 1 operation must not be used for the other.<br>
> <br>
> This distinction is more preferable than just changing the VectorShuffle creation semantics so that elements are wrapped to the interval [0, 2 * VLENGTH - 1] because of 3 reasons:<br>
> <br>
> - It aligns the wrapping in VectorShuffle creation with the wrapping in VectorShuffle usage, which helps reduce 1 unnecessary wrapping.<br>
> - It is necessary to support 2048-bit SVE rearrange. As for those, a 1-operand byte rearrange is sensible while a 2-operand byte rearrange is not (there are 512 elements in a table from 2 vectors, which is larger than the index values themselves). While we
can catch the usage in 2-operand rearrange, the semantics are muddy for other operations such as toVector or toString. This is because we inevitably lose information when casting the elements to the implementation-detail type. It would be confusing if, for
all species, elements are wrapped to the interval [0, 2 * VLENGTH - 1] while suddenly for 2048-bit byte shuffles, the elements are wrapped to the interval [0, VLENGTH - 1]. As a result, forbidding creating such a VectorShuffle in the first place is a more
sensible choice.<br>
> - Using a VectorShuffle for both 1-operand rearranges and 2-operand rearranges is questionable, as they have different semantics. If the users want to use 1 index vector that is sensible to be converted to both a 1-operand shuffle and a 2-operand shuffle.
Then 2 conversion seems to be a more reasonable thing to do.<br>
> <br>
> API-wise, I propose removing Vector::toShuffle and adding 8 methods:<br>
> <br>
> <T> VectorShuffle<T> ByteVector::toShuffle(VectorSpecies<T> species)<br>
> <T> VectorShuffle<T> ByteVector::toLookUpIndices(VectorSpecies<T> species, int numTable)<br>
> <T> VectorShuffle<T> ShortVector::toShuffle(VectorSpecies<T> species)<br>
> <T> VectorShuffle<T> ShortVector::toLookUpIndices(VectorSpecies<T> species, int numTable)<br>
> <T> VectorShuffle<T> IntVector::toShuffle(VectorSpecies<T> species)<br>
> <T> VectorShuffle<T> IntVector::toLookUpIndices(VectorSpecies<T> species, int numTable)<br>
> <T> VectorShuffle<T> LongVector::toShuffle(VectorSpecies<T> species)<br>
> <T> VectorShuffle<T> LongVector::toLookUpIndices(VectorSpecies<T> species, int numTable)<br>
> <br>
> where species must have the same length as the receiver and numTable must be 1 or 2, XXXVector::toShuffle(species) is equivalent to XXXVector::toLookUpIndices(species, 1)<br>
> <br>
> On the opposite direction, I also propose removing VectorShuffle::toVector and adding VectorShuffle::toVector(VectorSpecies species) where species must be an integral vector with the same length as the receiver.<br>
> <br>
> Alternatively, we can split VectorShuffle into VectorShuffle and VectorLookUpIndices. The former can only be used for 1-operand rearrange while the latter can only be used for 2-operand rearrange. Personally, I prefer this approach as it gives us a stronger
type safety compared to discriminating VectorShuffle based on a field.<br>
> <br>
> Please share your thoughts, thanks a lot,<br>
> Quan Anh<o:p></o:p></p>
</blockquote>
</div>
</div>
</div>
</body>
</html>