RFR: 8353741: Improve UUID.toString performance by using SIMD within a register instead of table lookup

Johannes Graham duke at openjdk.org
Fri Apr 4 16:47:19 UTC 2025


On Mon, 6 Jan 2025 13:18:50 GMT, Shaojin Wen <swen at openjdk.org> wrote:

> Improve the performance of UUID::toString by using Long.expand and SWAR (SIMD within a register) instead of table lookup. Eliminating the table lookup can also avoid the performance degradation problem when the cache misses.

By stepping through the code of `Long.expand`, and substituting in the constants, I come up with this:


   static long expandNibbles(long i){
        // Inlined version of Long.expand(i,0x0F0F_0F0F_0F0F_0F0FL)
        long t = i << 16;
        i = (i & ~0xFFFF00000000L) | (t & 0xFFFF00000000L);
        t = i << 8;
        i = (i & ~0xFF000000FF0000L) | (t & 0xFF000000FF0000L);
        t = i << 4;
        i = (i & ~0xF000F000F000F00L) | (t & 0xF000F000F000F00L);
        
        return i & 0x0F0F_0F0F_0F0F_0F0FL;
    }


This looks like it might actually do better than *Method 2*.  If inlining and constant folding is happening in  the non-intrinsic `Long.expand` I would imagine it would perform comparably to this.

The non-intrinsified java code should be able to run as quickly as the hand-inlined one.

I think I've found  an issue that prevents the code from being constant-folded as expected. C2 seems to not do constant-folding of xor nodes.

See https://github.com/openjdk/jdk/pull/23089 for an attempt at addressing this.

There are no XOR nodes in expandNibbles
![image](https://github.com/user-attachments/assets/057bc8fc-62a2-4fab-8d56-8e0128dac3cd)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2584577398
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2588342173
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2590840422


More information about the core-libs-dev mailing list