Improving the performance of OpenJDK

Wed Feb 18 04:37:19 PST 2009

Gary Benson wrote:
> Andrew Haley wrote:
>> Right.  The whole idea of the way it's don ATM is bonkers: do a
>> byte- at-a-time unaligned load into machine order, then reverse the
>> bytes.  Maybe the hope was that the compiler would see all this
>> cruft and silently convert it into an efficient form, but, er, no.
>> :-(
> 
> How does it work for longer types?  For a 64-bit value, for instance,
> is it better to always do 8 individual loads, or might it be better
> to try and optimize like they have done?

It depends on the frequency of execution.  If the alignment is no
better than random with a uniform distribution, then half the time
you'll be looking at an address that is not aligned for any type
larger than a byte.  If so, there's no point checking for special cases.

It is, however, worth avoiding 64-bit operations on 32-bit platforms,
so this is probably the best way to do a 64-bit big-endian load, at
least on gcc:

unsigned long long foo6 (unsigned char *p)
{
  unsigned long u1;

  u1 = ((unsigned long)p[0] << 24
	| (unsigned long)p[1] << 16
	| (unsigned long)p[2] << 8
	| p[3]);

  unsigned long u2;

  u2 = ((unsigned long)p[4] << 24
	| (unsigned long)p[5] << 16
	| (unsigned long)p[6] << 8
	| p[7]);

  return ((unsigned long long)u1<<32 | u2);
}

The code generated here may be significantly better on 32-bit platforms
than the equivalent that uses unsigned long long, and not significantly
worse on 64-bit platforms.

Andrew.