Variability of the performance of Vector<E>
Daniel Lemire
daniel at lemire.me
Thu Jan 8 00:28:38 UTC 2026
The 25% is real but it affects mostly simple functions that do little compute. Like a memory copy, or a quick scan of an input.
Daniel Lemire, "Dot product on misaligned data," in *Daniel Lemire's blog*, July 14, 2025, https://lemire.me/blog/2025/07/14/dot-product-on-misaligned-data/.
>
> > I am worried about the variability of the performance of Vector<E>. Worse, I am worried about how to explain to users the variability of the performance of Vector<E>.
> …
> > Aligning arrays to avoid occasional 25% performance loss seems like a worthwhile goal. I would like to open a discussion about how that end might be achieved.
>
> full message: https://mail.openjdk.org/pipermail/hotspot-gc-dev/2026-January/056951.html
>
> Much of this is a question for panama-dev, where we are very aware
> of the difficulties of building a user model for vector programming.
> I appreciate the similar thread from your here on panama-dev:
> https://mail.openjdk.org/pipermail/panama-dev/2025-September/021141.html
>
> But the final question is very important for the GC folks, and is indeed
> worth a discussion. Actually we have been discussing it one way or another
> for years, at a low level of urgency. Maybe there are enough concurrent
> factors to justify taking the plunge (in 2026?) towards hyper-aligned
> Java heap objects, at least large arrays.
>
> Predictable vector performance requires aligned arrays, aligned either
> to cache lines or to some hardware vector size.
>
> For Valhalla, if we decide to work with 128-bit value containers,
> we would need not only arrays but also class instances that are
> aligned to 128 bits. (Why 128-bit containers? Well, they are
> efficiently atomic on ARM and maybe x86, and many Valhalla types
> would use them if they could. But misaligning them spoils
> atomicity. Valhalla is limited to 64-bit flattening as long as
> the existing heap alignment schemes are present.)
>
> Aligning an array is not exactly the same task as aligning an object,
> since for arrays you should align the address &a[0], while for an
> object o you must align some field &o.f, but you can get away with
> aligning &o._header (and put padding after the object header).
>
> In this space, there are lots of ways to pick out a set of
> requirements and techniques.
>
> Fundamentally, though, the GC needs to recognize that some array or
> (maybe) object is subject to hyper-alignment, and perform special-case
> allocation on it. There’s lots of bookkeeping around that fundamental,
> including sizing logic (might we need more space for inter-object padding?)
> and of course the initial contracts. (I.e., how does the user request
> hyper-alignment?)
>
> And there is the delicate question of optimization: How do we keep
> hot loops in the GC from acquiring an extra data-dependent test (for
> the "hyper-align bit"). Can we hide the test under another pre-existing
> test? Can be batch things so that normally aligned objects are segregated
> from the hyper-aligned ones, and version our hot loops accordingly?
>
> ("Another pre-existing test" — I’m thinking something like an object
> header test that already exists, where some rare object header bit
> configuration must already be tested for, and is expected to be
> rare. In that case, all hyper-aligned objects, whether arrays or
> not, would be put into that rare header state, and on the rarely
> taken path in the GC loop that handles the pre-existing rare state,
> we’d also handle the case of hyper-alignment. Seems likely that
> would be an option…)
>
> On top of all that is portability — we have to do this work several
> times, once for each GC. Or, if a particular GC configuration cannot
> support hyper-alignment, the user model must offer a fallback.
>
> (The fallback might look like, "If you use the vector API you should
> really enable Z or G1". And also, "You get better flattening for
> larger value objects if you run on ARM with Z or G1.")
>
> I haven’t even begun to assign header bits here. The problems are
> deeper than that!
>
> I will say one more thing about arrays: I think it would be very
> reasonable to align all arrays larger than a certain threshold size,
> fully to the platform cache line size, so that the &a[0] starts at
> a cache line boundary. This goes for primitives, values, whatever.
>
> Call this particular feature "large array hyper alignment".
>
> It might be a first feature to implement in this space.
>
> I have filed an RFE here: https://bugs.openjdk.org/browse/JDK-8374748
>
> Note that many hot GC loops that process arrays are O(a.length).
> This means that doing a little extra work for long lengths is
> almost by definition a negligible overhead.
>
> Large array hyper alignment would neatly solve Peter’s problem.
>
> And it would give us a head start towards Valhalla atomics,
> as long as we didn’t paint ourselves into some corner.
> The RFE mentions some possible follow-on features.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20260107/6821aceb/attachment-0001.htm>
More information about the panama-dev
mailing list