Variability of the performance of Vector<E>

Thu Jan 8 00:28:38 UTC 2026

The 25% is real but it affects mostly simple functions that do little compute. Like a memory copy, or a quick scan of an input. 

Daniel Lemire, "Dot product on misaligned data," in *Daniel Lemire's blog*, July 14, 2025, https://lemire.me/blog/2025/07/14/dot-product-on-misaligned-data/.

> 
> > I am worried about the variability of the performance of Vector<E>.  Worse, I am worried about how to explain to users the variability of the performance of Vector<E>.
> …
> > Aligning arrays to avoid occasional 25% performance loss seems like a worthwhile goal.  I would like to open a discussion about how that end might be achieved.
> 
> full message: https://mail.openjdk.org/pipermail/hotspot-gc-dev/2026-January/056951.html
> 
> Much of this is a question for panama-dev, where we are very aware
> of the difficulties of building a user model for vector programming.
> I appreciate the similar thread from your here on panama-dev:
> https://mail.openjdk.org/pipermail/panama-dev/2025-September/021141.html
> 
> But the final question is very important for the GC folks, and is indeed
> worth a discussion.  Actually we have been discussing it one way or another
> for years, at a low level of urgency.  Maybe there are enough concurrent
> factors to justify taking the plunge (in 2026?) towards hyper-aligned
> Java heap objects, at least large arrays.
> 
> Predictable vector performance requires aligned arrays, aligned either
> to cache lines or to some hardware vector size.
> 
> For Valhalla, if we decide to work with 128-bit value containers,
> we would need not only arrays but also class instances that are
> aligned to 128 bits.  (Why 128-bit containers?  Well, they are
> efficiently atomic on ARM and maybe x86, and many Valhalla types
> would use them if they could.  But misaligning them spoils
> atomicity.  Valhalla is limited to 64-bit flattening as long as
> the existing heap alignment schemes are present.)
> 
> Aligning an array is not exactly the same task as aligning an object,
> since for arrays you should align the address &a[0], while for an
> object o you must align some field &o.f, but you can get away with
> aligning &o._header (and put padding after the object header).
> 
> In this space, there are lots of ways to pick out a set of
> requirements and techniques.
> 
> Fundamentally, though, the GC needs to recognize that some array or
> (maybe) object is subject to hyper-alignment, and perform special-case
> allocation on it.  There’s lots of bookkeeping around that fundamental,
> including sizing logic (might we need more space for inter-object padding?)
> and of course the initial contracts.  (I.e., how does the user request
> hyper-alignment?)
> 
> And there is the delicate question of optimization:  How do we keep
> hot loops in the GC from acquiring an extra data-dependent test (for
> the "hyper-align bit").  Can we hide the test under another pre-existing
> test?  Can be batch things so that normally aligned objects are segregated
> from the hyper-aligned ones, and version our hot loops accordingly?
> 
> ("Another pre-existing test" — I’m thinking something like an object
> header test that already exists, where some rare object header bit
> configuration must already be tested for, and is expected to be
> rare.  In that case, all hyper-aligned objects, whether arrays or
> not, would be put into that rare header state, and on the rarely
> taken path in the GC loop that handles the pre-existing rare state,
> we’d also handle the case of hyper-alignment.  Seems likely that
> would be an option…)
> 
> On top of all that is portability — we have to do this work several
> times, once for each GC.  Or, if a particular GC configuration cannot
> support hyper-alignment, the user model must offer a fallback.
> 
> (The fallback might look like, "If you use the vector API you should
> really enable Z or G1".  And also, "You get better flattening for
> larger value objects if you run on ARM with Z or G1.")
> 
> I haven’t even begun to assign header bits here.  The problems are
> deeper than that!
> 
> I will say one more thing about arrays:  I think it would be very
> reasonable to align all arrays larger than a certain threshold size,
> fully to the platform cache line size, so that the &a[0] starts at
> a cache line boundary.  This goes for primitives, values, whatever.
> 
> Call this particular feature "large array hyper alignment".
> 
> It might be a first feature to implement in this space.
> 
> I have filed an RFE here: https://bugs.openjdk.org/browse/JDK-8374748
> 
> Note that many hot GC loops that process arrays are O(a.length).
> This means that doing a little extra work for long lengths is
> almost by definition a negligible overhead.
> 
> Large array hyper alignment would neatly solve Peter’s problem.
> 
> And it would give us a head start towards Valhalla atomics,
> as long as we didn’t paint ourselves into some corner.
> The RFE mentions some possible follow-on features.
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20260107/6821aceb/attachment-0001.htm>