<!DOCTYPE html><html><head><title></title></head><body><div>The 25% is real but it affects mostly simple functions that do little compute. Like a memory copy, or a quick scan of an input. </div><div><br></div><div><span class="color" style="color:rgb(34, 49, 63);">Daniel Lemire, "Dot product on misaligned data," in </span><i>Daniel Lemire's blog</i><span class="color" style="color:rgb(34, 49, 63);">, July 14, 2025, </span><a href="https://lemire.me/blog/2025/07/14/dot-product-on-misaligned-data/">https://lemire.me/blog/2025/07/14/dot-product-on-misaligned-data/</a><span class="color" style="color:rgb(34, 49, 63);">.</span></div><div><br></div><div><br></div><div><br></div><blockquote type="cite" id="qt" style=""><div><br></div><div>> I am worried about the variability of the performance of Vector<E>. Worse, I am worried about how to explain to users the variability of the performance of Vector<E>.</div><div>…</div><div>> Aligning arrays to avoid occasional 25% performance loss seems like a worthwhile goal. I would like to open a discussion about how that end might be achieved.</div><div><br></div><div>full message: <a href="https://mail.openjdk.org/pipermail/hotspot-gc-dev/2026-January/056951.html">https://mail.openjdk.org/pipermail/hotspot-gc-dev/2026-January/056951.html</a></div><div><br></div><div>Much of this is a question for panama-dev, where we are very aware</div><div>of the difficulties of building a user model for vector programming.</div><div>I appreciate the similar thread from your here on panama-dev:</div><div><a href="https://mail.openjdk.org/pipermail/panama-dev/2025-September/021141.html">https://mail.openjdk.org/pipermail/panama-dev/2025-September/021141.html</a></div><div><br></div><div>But the final question is very important for the GC folks, and is indeed</div><div>worth a discussion. Actually we have been discussing it one way or another</div><div>for years, at a low level of urgency. Maybe there are enough concurrent</div><div>factors to justify taking the plunge (in 2026?) towards hyper-aligned</div><div>Java heap objects, at least large arrays.</div><div><br></div><div>Predictable vector performance requires aligned arrays, aligned either</div><div>to cache lines or to some hardware vector size.</div><div><br></div><div>For Valhalla, if we decide to work with 128-bit value containers,</div><div>we would need not only arrays but also class instances that are</div><div>aligned to 128 bits. (Why 128-bit containers? Well, they are</div><div>efficiently atomic on ARM and maybe x86, and many Valhalla types</div><div>would use them if they could. But misaligning them spoils</div><div>atomicity. Valhalla is limited to 64-bit flattening as long as</div><div>the existing heap alignment schemes are present.)</div><div><br></div><div>Aligning an array is not exactly the same task as aligning an object,</div><div>since for arrays you should align the address &a[0], while for an</div><div>object o you must align some field &o.f, but you can get away with</div><div>aligning &o._header (and put padding after the object header).</div><div><br></div><div>In this space, there are lots of ways to pick out a set of</div><div>requirements and techniques.</div><div><br></div><div>Fundamentally, though, the GC needs to recognize that some array or</div><div>(maybe) object is subject to hyper-alignment, and perform special-case</div><div>allocation on it. There’s lots of bookkeeping around that fundamental,</div><div>including sizing logic (might we need more space for inter-object padding?)</div><div>and of course the initial contracts. (I.e., how does the user request</div><div>hyper-alignment?)</div><div><br></div><div>And there is the delicate question of optimization: How do we keep</div><div>hot loops in the GC from acquiring an extra data-dependent test (for</div><div>the "hyper-align bit"). Can we hide the test under another pre-existing</div><div>test? Can be batch things so that normally aligned objects are segregated</div><div>from the hyper-aligned ones, and version our hot loops accordingly?</div><div><br></div><div>("Another pre-existing test" — I’m thinking something like an object</div><div>header test that already exists, where some rare object header bit</div><div>configuration must already be tested for, and is expected to be</div><div>rare. In that case, all hyper-aligned objects, whether arrays or</div><div>not, would be put into that rare header state, and on the rarely</div><div>taken path in the GC loop that handles the pre-existing rare state,</div><div>we’d also handle the case of hyper-alignment. Seems likely that</div><div>would be an option…)</div><div><br></div><div>On top of all that is portability — we have to do this work several</div><div>times, once for each GC. Or, if a particular GC configuration cannot</div><div>support hyper-alignment, the user model must offer a fallback.</div><div><br></div><div>(The fallback might look like, "If you use the vector API you should</div><div>really enable Z or G1". And also, "You get better flattening for</div><div>larger value objects if you run on ARM with Z or G1.")</div><div><br></div><div>I haven’t even begun to assign header bits here. The problems are</div><div>deeper than that!</div><div><br></div><div>I will say one more thing about arrays: I think it would be very</div><div>reasonable to align all arrays larger than a certain threshold size,</div><div>fully to the platform cache line size, so that the &a[0] starts at</div><div>a cache line boundary. This goes for primitives, values, whatever.</div><div><br></div><div>Call this particular feature "large array hyper alignment".</div><div><br></div><div>It might be a first feature to implement in this space.</div><div><br></div><div>I have filed an RFE here: <a href="https://bugs.openjdk.org/browse/JDK-8374748">https://bugs.openjdk.org/browse/JDK-8374748</a></div><div><br></div><div>Note that many hot GC loops that process arrays are O(<a href="http://a.length">a.length</a>).</div><div>This means that doing a little extra work for long lengths is</div><div>almost by definition a negligible overhead.</div><div><br></div><div>Large array hyper alignment would neatly solve Peter’s problem.</div><div><br></div><div>And it would give us a head start towards Valhalla atomics,</div><div>as long as we didn’t paint ourselves into some corner.</div><div>The RFE mentions some possible follow-on features.</div><div><br></div></blockquote><div><br></div></body></html>