Boxing function types

Wed Nov 25 08:19:36 PST 2009

I'm not sure exactly why I was CCed on this and
haven't haven't been closely following recent
discussions but I signed up to this list to
reply explaining some of the issues here.

First: If you want parallel speedups (especially
for operations on aggregates/arrays/collections),
you need to minimize interference across parallel
threads; And you need to do this at all levels --
CPUs, memory, JVM support, Java support.

Boxing is among the enemies of parallelization.
When you box things, you get less locality,
hence more interference, among threads. If you
have for example arrays where each cell points to
some data rather than being that data, you
face the possibility that the actual data
are randomly strewn across memory. So you not only
get a lot more cache misses (which are increasingly
very expensive ) but you also have cacheline
ping-ponging due to different data items being
operated on by different threads just so happening to
be nearby. In other words, boxing not only has
constant-time time/space overhead, but is also
a scalability impediment.

These are not small problems. You can easily
create ForkJoin programs that speed up linearly
when using localized data but barely speed up
at all otherwise (plus of course the sequential
cases is slower to begin with under boxing).

So, improving locality control is among the
biggest upcoming issues for those of us trying
to improve parallelization support in core
libraries and JVMs. For example, I think it is
a sure thing that at some point there will
need to be embeddable value/struct/tuple support,
(these things will probably NOT be java.lang.Objects).
As is already seen in languages like X10 and
Fortress, which are targeted to be able to run
on JVMs, so JVMs will probably evolve to support
them even if there is no Java language support.

There are plenty of other issues along these
lines too, like coping with MPs vs Multicore
vs MPs-of-Multicores wrt cache affinities vs
cache pollution. But not allowing efficiencies
when we can get them already (like the case of
plain scalars) would be a bad move. I really,
really don't want to do is release an overhyped
facility for parallel programming that doesn't
actually speed up anyone's applications.

-Doug