TaggedArrays

Thu Sep 13 04:14:59 PDT 2012

On 2012-09-12, at 11:06 PM, Charles Oliver Nutter <headius at headius.com> wrote:

> On Wed, Sep 12, 2012 at 5:24 PM, Jim Laskey <jlaskey at me.com> wrote:
>> The builtin tagged array (TaggedArrayNativeImpl) has no levels of
>> indirection, the simple and optimized forms are fall backs.  So the
>> performance is similar to Java arrays.  And yes the code that Rickard
>> submitted here is for native performance.
>> 
>> One would assume your math operations would assume tagged integers.  So
>> there so there should only be ints and BigIntegers (rare).
>> 
>> Naive Ex:
> 
> This is very illustrative, thank you! I never got around to commenting
> on your original email, so I'm taking some time now (I'm also doing a
> talk on "high performance Ruby" at Golden Gate RubyConf on Friday, so
> I'm looking at perf and numbers again).
> 
>> TaggedArray stack = TaggedArrays.allocate(512 * 1024);
>> int SP = 0;
>> 
>> SP = add(stack, SP);
>> 
>> 
>> int add(TaggedArray stack, int SP) {
>> try {
>> long x = stack.getValue(SP--); // simple array indexing with tag detection
>> long y = stack.getValue(SP--); // simple array indexing with tag detection
>> long z = x + y - 1; // simple integer addition, accounting for the tag bit
>> isOverflow(z); // Throws exception for overflow
>> stack.setValue(++SP); // simple array indexing with tag detection
>> return SP;
>> } catch (  ) {
>> // slow case for arithmetic overflow,  BigIntegers and stack overflow
>> exceptions.
>> }
>> So most of the time you will be using straight ints.  Tag detection is a bit
>> test and branch.
> 
> I get this example, but I'll describe it a bit for myself (and maybe others).
> 
> * Each thread gets a TaggedArray to use for storing values. You bite
> chunks off this by passing an on-stack (Java stack) int offset along
> with the TaggedArray, and all local variable use would work with the
> TaggedArray elements rather than Java locals.
> * Because we don't store locals in Object slots on the Java stack
> frame, we aren't forced to use boxed numerics all the time.
> * Accessing the elements anywhere requires passing the TaggedArray and
> the offset where the value is, e.g. to math logic (which can either
> receive the SP where the result should go or return an SP stating
> where it put the result).
> 
> As Mark pointed out, for those of us with growable instance var lists,
> those would be TaggedArray implementations, and we could store long
> values directly.
> 
> Yes, I can see how this would work. HOWEVER... :)
> 
> It seems like it's still impossible for this to ever approach the
> performance of actual Java primitives (or tagged numerics) living
> directly on the JVM stack. At a minimum, you're doing all that
> bouncing in and out of TaggedArray, which for any non-trivial program
> will be hundreds of elements long. Indeed, most of that should be in
> cache, but it's still not going to be as fast as JVM local/stack
> access, right?

No and yes.  In terms of comparing apples and apples lets describe another TaggedArray implementation more closely tied to how frames work.  Stack frames on hardware like x86 are often managed with stack and frame pointers (esp, efp on x86).  On entry to a method prologue instructions do something like;

	mem[--SP] = FP;  // save old frame pointer
	FP = SP;         // frame pointer updated for new function
	SP -= frameSize; // allocate frame

Epilogue; 
	SP = FP;
	FP = mem[SP++];

After the prologue access to arguments would be via FP[+offset] and locals FP[-offset] which are both fixed offsets from base pointer FP.  Hotspot prologue is a bit more complicated than this but for illustration this will suffice.

So if we did the equivalent in the one stack per frame version we would do something like (note stack growing positively);

int foo(TaggedArray stack, int SP) {
	int FP = SP; // local copy (cheating)
	SP += frameSize; // in slots

	...

	T1 x = (T1)stack.getReference(FP - argOffset);  // fetch an argument

	T2 y = (T2)stack.getReference(FP + localOffset); // fetch a local

	SP = bar(stack, SP - nArgs); // call another function

	...

	return FP;
}

The cost here versus hardware frames is stack is not in an accelerated register like esp/efp and we have to use stack[FP, offset] opcodes instead of FP[offset] (plus the tag bit test.)

The alternative is to allocate frames per function.

void (TaggedArray scope, int scopeFP) {
	TaggedArray locals = TaggedArrays.allocate(nLocals + nMaxArgs);

	...

	T1 x = (T1)scope.getReference(scopeFP + argOffset);  // fetch an argument

	T2 y = (T2)locals.getReference(localOffset); // fetch a local

	bar(locals, nLocals); // call another function

	...

	// no epilogue
}

In this case, we save by accessing all locals as locals[offset] (plus the tag bit test) which is pretty much equivalent to native access of locals.  The cost here is of course the allocation of locals.

> 
> I am intrigued by the idea, I will admit. It's not something I can
> spend a day wiring into JRuby, of course, but I'm going to ponder it a
> bit. JRuby has managed to get pretty solid performance even with boxed
> numbers by slowly and carefully *eliminating* our artificial stack as
> much as possible. Many Ruby methods now use nothing but JVM locals,
> and for algorithms where we can make that guarantee we're many times
> faster than the next fastest Ruby impl. But by using JVM locals and
> stack for everything, we are also backing ourselves into a bit of a
> corner, since we have to use Object references everywhere and the JVM
> is not doing a good enough job of seeing through them.
> 

There is no reason not to use native locals for intermediaries and I would recommend it anyway since you will likely introduce stronger typing at that point anyway.

> I might play with prototyping a similar impl in JRuby, just to explore
> feasibility.
> 
> - Charlie
> _______________________________________________
> mlvm-dev mailing list
> mlvm-dev at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev