Intrinsics

Tue Sep 18 15:53:23 UTC 2018

It might make sense to not only have a "projection" node that splits a 
128 bit value into two 64-bit values, but also an explicit "fuse" node 
that combines two 64-bit values into one 128 bit value. I assume that in 
most cases you are going to have not a single arithmetic node, but an 
expression tree of 128-bit arithmetic nodes. You need to rely either on 
escape analysis or read elimination to get rid of the intermediate array 
stores / loads. With the explicit "fuse" nodes, you can then more easily 
remove the projections and fusing that are initially between arithmetic 
nodes, i.e., you can end up with an expression tree in high-level Graal 
IR that consists only of 128 bit arithmetic nodes, and only the final 
result needs a projection.

-Christian

On 09/18/2018 08:24 AM, Gilles Duboscq wrote:
> Hi Martin,
> 
> One of the way to do that is to combine both solution and use "projection" nodes to model the "multiple outputs" part" while still emitting the whole code sequence at once.
> 
> In the graph you have
> ```
> add128Node = Add128Node(long low1, long high1, long low2, long high2)
> result[0] = Add128LowNode(add128Node) // this is a projection of Add128Node
> result[1] = Add128HighNode(add128Node) // this is a projection of Add128Node
> return = Add128CarryNode(add128Node) // this is a projection of Add128Node
> ````
> 
> In `Add128Node.generate`, you will need to generate a LIR Op that has 3 results:
> 
> ```
> class Add128Op extends AMD64LIRInstruction {
>    @Use({REG, STACK}) protected AllocatableValue low1; // TODO might need HINTs
>    @Use({REG, STACK}) protected AllocatableValue low2;
>    @Use({REG, STACK}) protected AllocatableValue high1;
>    @Use({REG, STACK}) protected AllocatableValue high2;
> 
>    @Def({REG}) protected AllocatableValue lowResult;
>    @Def({REG}) protected AllocatableValue highResult;
>    @Def({REG}) protected AllocatableValue carryResult;
>    ...
>    void emitCode(CompilationResultBuilder crb, AMD64MacroAssembler masm) {
>      // see AMD64Binary.CommutativeTwoOp#emitCode
>      AllocatableValue lowInput;
>      if (sameRegister(lowResult, low2)) {
>        lowInput = low1;
>      } else {
>        AMD64Move.move(crb, masm, lowResult, low1);
>        lowInput = low2;
>      }
>      // TODO deal with stack vs reg etc.
>      masm.add(asRegister(lowResult), asRegister(lowInput));
>      // TODO setup highInput, stack vs reg etc.
>      masm.adc(highResult, highInput);
>      AMD64ControlFlow.cmove(crb, masm, carryResult, false, ConditionFlag.CarrySet, false,
>        new ConstantValue(toRegisterKind(AMD64Kind.BYTE), JavaConstant.forBoolean(true)),
>        new ConstantValue(toRegisterKind(AMD64Kind.BYTE), JavaConstant.forBoolean(false)))
>    }
> }
> ```
> 
> During `Add128Node.generate`, remember the values you used for `lowResult`, `highResult`, and `carryResult`:
> 
> ```
>    AllocatableValue low1Value = tool.operand(low1);
>    ...
>    this.lowResultValue = tool.getLIRGeneratorTool().newVariable(LIRKind.value(AMD64Kind.QWORD));
>    ...
>    tool.setResult(this, tool.getLIRGeneratorTool().append(new Add128Op(
>      low1Value, low2Value, high1Value, high2Value,
>      lowResultValue, highResultValue, carrtResultValue)));
> 
> ```
> 
> In `Add128LowNode.generate`, just do: `tool.setResult(this, getAdd128Node().getLowResultValue());`
> 
> I hope that helps.
>   Gilles
> 
> On 14/09/18 20:17, Martin Traverso wrote:
>> Hi,
>>
>> I'm playing around with Graal, and as an experiment, I'm trying to see what
>> it would take to intrinsify some operations to do math on 128-bit values.
>>
>> I have a method with the following signature:
>>
>>      boolean add128(long low1, long high1, long low2, long high2, long[]
>> result)
>>
>> It computes the sum of two 128-bit integers encoded in two longs each and
>> stores the result in the 2-element array that's provided via the last
>> argument. It returns true if the sum overflows.
>>
>> I'd like to emit the equivalent of the following assembly pseudocode:
>>
>>     result[0] = ADD low1 low2
>>     result[1] = ADC high1 high2
>>     return = (carry == 1)
>>
>>  From what I gathered so far, I should add a new node (e.g., Add128Node) and
>> register a builder a graph builder plugin that swaps invocations to that
>> method with the new node.
>>
>> But that's where I'm getting stuck. Two paths I've started exploring:
>> 1. Lower the Add128Node into operations that perform the sums of the high
>> vs low parts (e.g., Add128LowNode, Add128HighNode), do the assignments to
>> the resulting array, etc. This would seem to require modeling operations
>> that produce multiple outputs (low + low produces one value + carry). Is
>> this even possible?
>> 2. Make Add128Node LIRLowerable and generate the whole sequence of
>> low-level operations in one shot. I'm not sure how the assignments to the
>> output array and return value would fit here, though.
>>
>> I'm sure I'm missing something obvious, so I appreciate any pointers or
>> suggestions. Are there similar examples I can draw inspiration from?
>>
>> Thanks,
>> Martin
>>
>