Intrinsics

Tue Sep 18 20:27:49 UTC 2018

Thanks guys. This is very helpful. I'll give it a try and report back.

- Martin

On Tue, Sep 18, 2018 at 8:53 AM Christian Wimmer <
christian.wimmer at oracle.com> wrote:

> It might make sense to not only have a "projection" node that splits a
> 128 bit value into two 64-bit values, but also an explicit "fuse" node
> that combines two 64-bit values into one 128 bit value. I assume that in
> most cases you are going to have not a single arithmetic node, but an
> expression tree of 128-bit arithmetic nodes. You need to rely either on
> escape analysis or read elimination to get rid of the intermediate array
> stores / loads. With the explicit "fuse" nodes, you can then more easily
> remove the projections and fusing that are initially between arithmetic
> nodes, i.e., you can end up with an expression tree in high-level Graal
> IR that consists only of 128 bit arithmetic nodes, and only the final
> result needs a projection.
>
> -Christian
>
>
> On 09/18/2018 08:24 AM, Gilles Duboscq wrote:
> > Hi Martin,
> >
> > One of the way to do that is to combine both solution and use
> "projection" nodes to model the "multiple outputs" part" while still
> emitting the whole code sequence at once.
> >
> > In the graph you have
> > ```
> > add128Node = Add128Node(long low1, long high1, long low2, long high2)
> > result[0] = Add128LowNode(add128Node) // this is a projection of
> Add128Node
> > result[1] = Add128HighNode(add128Node) // this is a projection of
> Add128Node
> > return = Add128CarryNode(add128Node) // this is a projection of
> Add128Node
> > ````
> >
> > In `Add128Node.generate`, you will need to generate a LIR Op that has 3
> results:
> >
> > ```
> > class Add128Op extends AMD64LIRInstruction {
> >    @Use({REG, STACK}) protected AllocatableValue low1; // TODO might
> need HINTs
> >    @Use({REG, STACK}) protected AllocatableValue low2;
> >    @Use({REG, STACK}) protected AllocatableValue high1;
> >    @Use({REG, STACK}) protected AllocatableValue high2;
> >
> >    @Def({REG}) protected AllocatableValue lowResult;
> >    @Def({REG}) protected AllocatableValue highResult;
> >    @Def({REG}) protected AllocatableValue carryResult;
> >    ...
> >    void emitCode(CompilationResultBuilder crb, AMD64MacroAssembler masm)
> {
> >      // see AMD64Binary.CommutativeTwoOp#emitCode
> >      AllocatableValue lowInput;
> >      if (sameRegister(lowResult, low2)) {
> >        lowInput = low1;
> >      } else {
> >        AMD64Move.move(crb, masm, lowResult, low1);
> >        lowInput = low2;
> >      }
> >      // TODO deal with stack vs reg etc.
> >      masm.add(asRegister(lowResult), asRegister(lowInput));
> >      // TODO setup highInput, stack vs reg etc.
> >      masm.adc(highResult, highInput);
> >      AMD64ControlFlow.cmove(crb, masm, carryResult, false,
> ConditionFlag.CarrySet, false,
> >        new ConstantValue(toRegisterKind(AMD64Kind.BYTE),
> JavaConstant.forBoolean(true)),
> >        new ConstantValue(toRegisterKind(AMD64Kind.BYTE),
> JavaConstant.forBoolean(false)))
> >    }
> > }
> > ```
> >
> > During `Add128Node.generate`, remember the values you used for
> `lowResult`, `highResult`, and `carryResult`:
> >
> > ```
> >    AllocatableValue low1Value = tool.operand(low1);
> >    ...
> >    this.lowResultValue =
> tool.getLIRGeneratorTool().newVariable(LIRKind.value(AMD64Kind.QWORD));
> >    ...
> >    tool.setResult(this, tool.getLIRGeneratorTool().append(new Add128Op(
> >      low1Value, low2Value, high1Value, high2Value,
> >      lowResultValue, highResultValue, carrtResultValue)));
> >
> > ```
> >
> > In `Add128LowNode.generate`, just do: `tool.setResult(this,
> getAdd128Node().getLowResultValue());`
> >
> > I hope that helps.
> >   Gilles
> >
> > On 14/09/18 20:17, Martin Traverso wrote:
> >> Hi,
> >>
> >> I'm playing around with Graal, and as an experiment, I'm trying to see
> what
> >> it would take to intrinsify some operations to do math on 128-bit
> values.
> >>
> >> I have a method with the following signature:
> >>
> >>      boolean add128(long low1, long high1, long low2, long high2, long[]
> >> result)
> >>
> >> It computes the sum of two 128-bit integers encoded in two longs each
> and
> >> stores the result in the 2-element array that's provided via the last
> >> argument. It returns true if the sum overflows.
> >>
> >> I'd like to emit the equivalent of the following assembly pseudocode:
> >>
> >>     result[0] = ADD low1 low2
> >>     result[1] = ADC high1 high2
> >>     return = (carry == 1)
> >>
> >>  From what I gathered so far, I should add a new node (e.g.,
> Add128Node) and
> >> register a builder a graph builder plugin that swaps invocations to that
> >> method with the new node.
> >>
> >> But that's where I'm getting stuck. Two paths I've started exploring:
> >> 1. Lower the Add128Node into operations that perform the sums of the
> high
> >> vs low parts (e.g., Add128LowNode, Add128HighNode), do the assignments
> to
> >> the resulting array, etc. This would seem to require modeling operations
> >> that produce multiple outputs (low + low produces one value + carry). Is
> >> this even possible?
> >> 2. Make Add128Node LIRLowerable and generate the whole sequence of
> >> low-level operations in one shot. I'm not sure how the assignments to
> the
> >> output array and return value would fit here, though.
> >>
> >> I'm sure I'm missing something obvious, so I appreciate any pointers or
> >> suggestions. Are there similar examples I can draw inspiration from?
> >>
> >> Thanks,
> >> Martin
> >>
> >
>