[External] : Re: Question about RegMask::is_aligned_sets()

Sat Mar 13 19:38:35 UTC 2021

Thanks, Vladimir, you've been a big help!

Cheers,

- Corey

On 3/12/21 1:36 AM, Vladimir Ivanov wrote:
> 
> 
> On 12.03.2021 01:08, Corey Ashford wrote:
>> Thanks for your reply, Vladimir.  A few more questions below when you 
>> have the chance :)
>>
>> On 3/11/21 12:10 AM, Vladimir Ivanov wrote:
>>>
>>>
>>> On 11.03.2021 06:13, Corey Ashford wrote:
>>>> Hello Vladimir,
>>>>
>>>> Currently I'm looking at the register mask defined for the 
>>>> Power64-LE machine.  There are 64 128-bit registers defined via 
>>>> reg_def, e.g.:
>>>>
>>>> reg_def VSR25 ( SOC, SOC, Op_VecX, 25, NULL);
>>>>
>>>> But only the 20 of those vector registers that are declared as part 
>>>> of "reg_class vs_reg( ... )" end up with a mask in the generated 
>>>> ad_ppc_expand.cpp source file, and further, each of those registers 
>>>> is allocated just a single bit in the register mask:
>>>>
>>>> const RegMask _VS_REG_mask( 0x0, 0x0, 0x0, 0x0, 0x0, 0xfffff00, 0x0, 
>>>> 0x0, 0x0, 0x0 );
>>>>
>>>> I would have expected that since Op_VecX is a 128-bit type, it would 
>>>> have received four bits per register in the mask.
>>>>
>>>>
>>>> On x86, each of the 512-bit vector registers is declared using 16 
>>>> Op_RegF register declarations (which makes sense - 16 x 32 = 512), 
>>>> but on aarch64, which can have up to a 1024-bit vector register, 
>>>> vector registers are declared using just 8 x Op_RegF (8 x 32 = 256 
>>>> bits). There is an extensive comment in the aarch64.ad about this, 
>>>> but it seems to imply that the 32-bits-per-slot rule is not rigid 
>>>> (just as on PPC64)
>>>
>>> As you noted, there's no relation between register mask (1 slot per 
>>> register definition) and ideal register (VecX et al) chosed. You had 
>>> to declare as many registers as there are slots in the value to get 
>>> it working.
>>
>> So we should be declaring four register slots per vector register, 
>> instead of one, right?  I'm a bit worried as to that screwing up the 
>> existing implementation for vector register allocation.  I will 
>> experiment to see what happens.
> 
> Take a look at how x86 handles that:
> 
>    
> https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L62
> 
>>
>>> But, as part of SVE support, Arm came up with special VecA ideal 
>>> register which represents a variable-length vector >
>>> Considering what you wrote about Power64-LE, I'd avoid VecA and just 
>>> provide additional register definitions.
>>>
>>>> As to your comment about not needing to use a vector register for 
>>>> the boolean vectors, that's quite interesting.  So for all vector 
>>>> types except for tiny integers, I should be able to use a 64-bit bit 
>>>> general purpose register.  I'm very new to all of this that I'm not 
>>>> clear how easy it will be to mix and match register types like this, 
>>>> but I will start experimenting with the idea.
>>>
>>> Can you elaborate how masks are represented in Power64-LE ISA?
>>>
>>
>> There are no special mask/predicate registers.  Masking is done via 
>> the uppermost bit of each corresponding element in a vector register 
>> (the other bits are ignored).  Conversion from an array of boolean 
>> bytes of 0|1 is simple: the boolean bytes of the mask are first 
>> shuffled (via the vperm instr. on ppc64) into each element, then 
>> arithmetically negated, producing 0|-1, which means either all 0's or 
>> all 1's in each element, which effectively sets or clears the upper 
>> bit as desired.
> 
> That sounds very similar to how masks are represented in AVX/AVX2 on x86.
> 
>>> If there are special predicate registers, then you can rely on what 
>>> Intel and Arm folks are working on for AVX-512 and SVE support.
>>>
>>> On x86 AVX/AVX2 ISA, masks occupy wide vector register. So, a mask of 
>>> 4 elements for vector of ints (128-bit) ocupies 128-bit vector. And, 
>>> currently, ideal vector nodes follow that representation: mask has 
>>> the same ideal type as the value it is applied to (e.g., 
>>> vectorx[4]{int} and not a vector of 4 booleans).
>>
>> The mask is represented in Java is as a byte array, so there has to be 
>> a conversion from the byte array in memory to a vector mask (in this 
>> case a VecX). So it seems that conversion can done "locally" within 
>> node matching code in ppc.ad, and that the mask representation isn't 
>> ever seen as a different type for operand matching.  If that's the 
>> case, the fog is lifting a little.
> 
> There are special ideal nodes inserted to convert vector masks between 
> in-memory and in-register representations: VectorLoadMask and 
> VectorStoreMask.
> 
> Best regards,
> Vladimir Ivanov
> 
> 
> 
>>
>> Thank you,
>>
>> - Corey
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>>
>>>> On 3/5/21 3:19 PM, Vladimir Ivanov wrote:
>>>>> Hi Corey,
>>>>>
>>>>>> I'd like to understand the concept of "aligned sets" in RegMask.  
>>>>>> I believe I understand the RegMask idea overall, but I don't 
>>>>>> understand the idea of alignment of sets (actually the concept of 
>>>>>> sets in this context is also fuzzy).  I've looked at the code that 
>>>>>> implements is_aligned_sets, and I just can't yet seem to grok what 
>>>>>> requirement it is trying to verify.  I read RegMask.hpp's comments 
>>>>>> on the method protoype, and it didn't help me much, I'm afraid.  
>>>>>> If someone could give a paragraph or two of explanation, I'd 
>>>>>> really appreciate it.
>>>>>
>>>>> A register in RegMask is comprised of packed bits each representing 
>>>>> a 32-bit slot. So, a VecX register occupies 4 bits (128 = 4 x 32) 
>>>>> while VecZ needs 16 (512 = 16 x 32).
>>>>>
>>>>> Some code relies on the alignment when recovering base register 
>>>>> from VMReg:
>>>>>
>>>>> https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/e1cad97049642ab201d53ff608937f7e7ef3ff3e/src/hotspot/cpu/x86/registerMap_x86.cpp*L29__;Iw!!GqivPVa7Brio!ONsyUV4W1ucXIXrwfbZN_XYI2Q8iF0_VwN44U7AfWOKj6jGzA9o19MVhnSL9h4n4MN_Kgz0$ 
>>>>>
>>>>>
>>>>> src/hotspot/cpu/x86/registerMap_x86.cpp
>>>>>
>>>>>      29 address RegisterMap::pd_location(VMReg reg) const {
>>>>>      30   if (reg->is_XMMRegister()) {
>>>>>      31     int reg_base = reg->value() - 
>>>>> ConcreteRegisterImpl::max_fpr;
>>>>>      32     int base_reg_enc = (reg_base / 
>>>>> XMMRegisterImpl::max_slots_per_register);
>>>>>      33     assert(base_reg_enc >= 0 && base_reg_enc < 
>>>>> XMMRegisterImpl::number_of_registers, "invalid XMMRegister: %d", 
>>>>> base_reg_enc);
>>>>>      34     VMReg base_reg = as_XMMRegister(base_reg_enc)->as_VMReg();
>>>>>
>>>>>> We have started working on adding support to the PPC64-LE hotspot 
>>>>>> code for the Vector API.  In order to support Vector Masks, it 
>>>>>> seems we need to change our current support for fixed-length, 
>>>>>> 128-bit vectors to something that can be as short as two booleans. 
>>>>>> To do that we have changed the function min_vector_size in 
>>>>>> hotspot/cpu/ppc.ad to return 2 when the type is T_BOOLEAN, 
>>>>>> otherwise it still returns 16.
>>>>>>
>>>>>> My first task was to add support for vector masks, and so I added 
>>>>>> a new instruct to cpu/ppc/ppc.ad to match VectorLoadMask, which 
>>>>>> then necessitated adding some instructs for LoadVector and 
>>>>>> StoreVector of the appropriate lengths.
>>>>>
>>>>> I don't know much about PPC64-LE, but you don't have to use boolean 
>>>>> vectors. FTR masks have the same type as the vectors they are 
>>>>> applied to. Until recently (when work on predicated registers 
>>>>> started), it was the only mask representation in Ideal IR.
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>>> I have a test case that loads a vector mask for a vector of shorts:
>>>>>>
>>>>>> import jdk.incubator.vector.ShortVector;
>>>>>> import jdk.incubator.vector.VectorSpecies;
>>>>>> import jdk.incubator.vector.VectorMask;
>>>>>> import java.util.Random;
>>>>>>
>>>>>>
>>>>>> class TestVectorMaskShort {
>>>>>>    private static final VectorSpecies<Short> SPECIES = 
>>>>>> ShortVector.SPECIES_128;
>>>>>>
>>>>>>    public static VectorMask<Short> test(boolean[] bary) {
>>>>>>        VectorMask<Short> vmask = VectorMask.fromArray(SPECIES, 
>>>>>> bary, 0);
>>>>>>        return vmask;
>>>>>>    }
>>>>>>
>>>>>>    public static void main(String args[]) {
>>>>>>      Random ran = new Random(100);
>>>>>>      int counter = 0;
>>>>>>      boolean[] bary = new boolean[8];
>>>>>>      for (int i = 0; i < 20_000; i++) {
>>>>>>        for (int j = 0; j < bary.length; j++) {
>>>>>>          bary[j] = ran.nextBoolean();
>>>>>>        }
>>>>>>        VectorMask<Short> vmask = test(bary);
>>>>>>        if (vmask.allTrue()) {
>>>>>>          counter++;
>>>>>>        }
>>>>>>      }
>>>>>>      System.out.printf("counter = %d\n", counter);
>>>>>>    }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> When I run this test case, I get a runtime error:
>>>>>>
>>>>>> #  Internal Error 
>>>>>> (/home/cjashfor/git-trees/jdk/src/hotspot/share/opto/chaitin.cpp:951), 
>>>>>> pid=1341588, tid=1341601
>>>>>> #  assert(lrgmask.is_aligned_sets(RegMask::SlotsPerVecX)) failed: 
>>>>>> vector should be aligned
>>>>>>
>>>>>>
>>>>>> - Corey
>>>>>>
>>>>>> Corey Ashford
>>>>>> Software Engineer
>>>>>> IBM Systems, LTC OpenJDK team
>>>>>>
>>>>>> IBM