[EXT] Follow-up results for SwissTable with Vector API
Gavin Ray
ray.gavin97 at gmail.com
Sun Jan 8 17:36:03 UTC 2023
Thanks Yifan, much appreciated!
It seems like a compelling implementation for read-heavy workloads.
I too would be curious about the reasons for performance difference between
the default HashMap for other operations.
On Sat, Jan 7, 2023 at 2:09 PM Zhu, Yifan <yzhu104 at ur.rochester.edu> wrote:
> Highlights of the dropped message:
>
>
> 1. Using ByteVector instead of Vector<Byte> gives better performance
> 2. ByteVector::lt(byte) is not optimized for loop invariant values
> generating a boardcase on each iteration, so I cache a final value for
> loop, and it improves the performance
> 3. For swisstable on x86, AVX2/AVX512 has a slightly lower peak thrput
> due to higher latency. (thou the paralism of wider register is better)Si
>
>
> --------------------------------------------------------------------------------------
>
> I finall get time to run the full benchmark. Here are current results:
>
> *Update:* As suggested, I added a generic implementation without vector
> API to do comparison.
>
> JVM info
>
> # VM version: JDK 19.0.1, Java HotSpot(TM) 64-Bit Server VM, 19.0.1+10-21
>
> Server Info
>
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Address sizes: 48 bits physical, 48 bits virtual
> Byte Order: Little Endian
> CPU(s): 128
> On-line CPU(s) list: 0-127
> Vendor ID: AuthenticAMD
> Model name: AMD EPYC 7773X 64-Core Processor
> CPU family: 25
> Model: 1
> Thread(s) per core: 2
> Core(s) per socket: 64
> Socket(s): 1
> Stepping: 2
> Frequency boost: enabled
> CPU(s) scaling MHz: 64%
> CPU max MHz: 3527.7339
> CPU min MHz: 1500.0000
> BogoMIPS: 4399.93
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid
> aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skini
> t wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx sm
> ap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale
> vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
> Virtualization features:
> Virtualization: AMD-V
> Caches (sum of all):
> L1d: 2 MiB (64 instances)
> L1i: 2 MiB (64 instances)
> L2: 32 MiB (64 instances)
> L3: 768 MiB (8 instances)
> NUMA:
> NUMA node(s): 1
> NUMA node0 CPU(s): 0-127
> Vulnerabilities:
> Itlb multihit: Not affected
> L1tf: Not affected
> Mds: Not affected
> Meltdown: Not affected
> Mmio stale data: Not affected
> Retbleed: Not affected
> Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
> Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
> Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
> Srbds: Not affected
> Tsx async abort: Not affected
>
> Random Operation (Three hashmap implementations execute a same randomly
> generated sequence of insert/find/remove operations. The sequence length is
> 100000):
>
> Benchmark Mode Cnt Score Error Units
> RandomOperations.genericSwissTableRandomOperation thrpt 5 264.657 ± 2.645 ops/s
> RandomOperations.hashMapRandomOperation thrpt 5 280.280 ± 11.416 ops/s
> RandomOperations.swissTableRandomOperation thrpt 5 304.007 ± 6.635 ops/s
>
> Find Operation (Three hashmap implementations find the same generated data
> sequence. The string keys are around 190 bytes. SwissTables are using
> WyHash as hasher. The sequence length is 100000)
>
> Benchmark Mode Cnt Score Error Units
> FindBenchmark.genericSwissTableFindExistingLong thrpt 5 440.841 ± 2.525 ops/s
> FindBenchmark.genericSwissTableFindExistingString thrpt 5 500.632 ± 9.925 ops/s
> FindBenchmark.genericSwissTableFindMissingLong thrpt 5 852.220 ± 3.159 ops/s
> FindBenchmark.genericSwissTableFindMissingString thrpt 5 1024.492 ± 6.017 ops/s
> FindBenchmark.hashTableFindExistingLong thrpt 5 848.031 ± 5.488 ops/s
> FindBenchmark.hashTableFindExistingString thrpt 5 1247.213 ± 11.066 ops/s
> FindBenchmark.hashTableFindMissingLong thrpt 5 884.276 ± 0.983 ops/s
> FindBenchmark.hashTableFindMissingString thrpt 5 1386.718 ± 72.998 ops/s
> FindBenchmark.swissTableFindExistingLong thrpt 5 717.106 ± 1.653 ops/s
> FindBenchmark.swissTableFindExistingString thrpt 5 689.134 ± 11.573 ops/s
> FindBenchmark.swissTableFindMissingLong thrpt 5 1143.098 ± 5.033 ops/s
> FindBenchmark.swissTableFindMissingString thrpt 5 1562.995 ± 893.966 ops/s
>
> *Notice: I am not sure why java.util.HashMap performs better when finding
> existing keys, is there any specialization when JVM sees that the finding
> sequence is the same as the insertion sequence?*
>
> Insert Operation (Three hashmap implementations insert the same generated
> data sequence. The string keys are around 190 bytes. For string, all three
> tables use cached WyHash value (overloaded method). For long integer, the
> identity hash is used. The sequence length is 100000)
>
> Benchmark Mode Cnt Score Error Units
> InsertionBenchmark.genericSwissTableLongInsertion thrpt 5 251.994 ± 6.383 ops/s
> InsertionBenchmark.genericSwissTableStringInsertion thrpt 5 242.455 ± 8.091 ops/s
> InsertionBenchmark.hashMapLongInsertion thrpt 5 157.068 ± 10.528 ops/s
> InsertionBenchmark.hashMapStringInsertion thrpt 5 157.860 ± 5.385 ops/s
> InsertionBenchmark.swissTableLongInsertion thrpt 5 281.914 ± 3.051 ops/s
> InsertionBenchmark.swissTableStringInsertion thrpt 5 265.384 ± 2.884 ops/s
>
> Iteration (Iterate through the whole map using iterator)
>
> Benchmark Mode Cnt Score Error Units
> Iteration.genericSwissTableIteration thrpt 5 1615.441 ± 44.650 ops/s
> Iteration.hashMapIteration thrpt 5 562.280 ± 3.662 ops/s
> Iteration.swissTableIteration thrpt 5 2667.606 ± 1454.997 ops/s
>
>
>
>
>
> Schrodinger ZHU Yifan, Ph.D. Student
> Computer Science Department, University of Rochester
>
> *Personal Email:* i at zhuyi.fan
> *Work Email:* yifanzhu at rochester.edu
> *Website:* https://www.cs.rochester.edu/~yzhu104/Main.html
> *Github:* SchrodingerZhu
> *GPG Fingerprint:* BA02CBEB8CB5D8181E9368304D2CC545A78DBCC3
>
>
> ------------------------------
> *发件人:* Gavin Ray <ray.gavin97 at gmail.com>
> *发送时间:* 2023年1月7日 23:28
> *收件人:* Paul Sandoz <paul.sandoz at oracle.com>
> *抄送:* Zhu, Yifan <yzhu104 at UR.Rochester.edu>; panama-dev at openjdk.org <
> panama-dev at openjdk.org>
> *主题:* Re: [EXT] Follow-up results for SwissTable with Vector API
>
> Zhu I'm very interested in this discussion, in the event there were mails
> that were dropped, FWIW
> A SwissTable implementation based on Vector intrinsics + FFM API would be
> super useful for a lot of applications.
>
> This is the history that I see:
>
> [image: image.png]
>
> On Fri, Jan 6, 2023 at 11:32 AM Paul Sandoz <paul.sandoz at oracle.com>
> wrote:
>
> In some further replies I just noticed you dropped the panama-dev email.
> Resend a summary of the discussion?
>
> Paul.
>
> > On Jan 5, 2023, at 2:16 PM, Zhu, Yifan <yzhu104 at UR.Rochester.edu> wrote:
> >
> > I am confused. It seems that my replies are detached from the mailling
> list. It that expected?
> >
> >
> > <Outlook-3cjuahvq.png>
> > Schrodinger ZHU Yifan, Ph.D. Student
> > Computer Science Department, University of Rochester
> >
> > Personal Email: i at zhuyi.fan
> > Work Email: yifanzhu at rochester.edu
> > Website: https://www.cs.rochester.edu/~yzhu104/Main.html
> > Github: SchrodingerZhu
> > GPG Fingerprint: BA02CBEB8CB5D8181E9368304D2CC545A78DBCC3
> >
> > <Outlook-fjqcbbcv.svg>
> > 发件人: panama-dev <panama-dev-retn at openjdk.org> 代表 Paul Sandoz <
> paul.sandoz at oracle.com>
> > 发送时间: 2023年1月6日 0:29
> > 收件人: Zhu, Yifan <yzhu104 at UR.Rochester.edu>
> > 抄送: panama-dev at openjdk.org <panama-dev at openjdk.org>
> > 主题: [EXT] Re: Follow-up results for SwissTable with Vector API
> >
> > Hi,
> >
> > I saw you sent another email prior to this, but for some reason it got
> lost by the moderation system. (Since you are not a member of the list the
> emails need to be moderated and approved.)
> >
> >
> > > On Jan 5, 2023, at 8:09 AM, Zhu, Yifan <yzhu104 at UR.Rochester.edu>
> wrote:
> > >
> > > This is the following up message for
> https://urldefense.com/v3/__https://mail.openjdk.org/pipermail/jdk-dev/2023-January/007288.html__;!!CGUSO5OYRnA7CQ!c6-WVSHfkvXgbKEtNWhxgdZ9EHDDMbmUz9AbxpvbYN54xt_4LzTwYJd4PdHmueDBCmryWsWBXfjE-Jpw_Cfrf6WyAw$
> .
> > >
> > > > You do:
> > > > converted.intoMemorySegment(MemorySegment.ofArray(control), offset,
> ByteOrder.nativeOrder());
> > > >
> > > > Can you just do:
> > > >
> > > > converted.intoArray(control, offset);
> > >
> > >
> > > I did so because I found that Vector<Byte> actually does not have that
> method.
> >
> > Ah, yes. There could be an perf issue with memory segment access,
> although since you had to wrap the array in a segment there will be some
> cost to that. It’s like if you wrapped the control array in a segment and
> stored in a field it would work better.
> >
> >
> > > After your suggestion, I switched to use ByteVector instead by
> Vector<Byte>. Surprisingly, this time the hashmap delivers a better
> performance. It 2~3 times faster during the insertion procedure.
> >
> > Good!
> >
> >
> > > However, there was still a performance gap behind the standard hashmap
> during finding precedure.
> > >
> > > For the ease of discussion, I attach the relevant code here:
> > >
> > > private int findWithHash(long hash, K key) {
> > > byte h2 = Util.h2(hash); //highest 7 bits
> > > int position = Util.h1(hash) & bucketMask; // h1 is just long to
> int
> > > int stride = 0;
> > > while (true) {
> > > var mask = matchByte(position, h2).toLong(); // match byte is
> to load a vector of byte and do equality comparison
> > > while (MaskIterator.hasNext(mask)) {
> > > var bit = MaskIterator.getNext(mask);
> > > mask = MaskIterator.moveNext(mask);
> > > var index = (position + bit) & bucketMask;
> > > if (key.equals(keys[index])) return index;
> > > }
> > >
> > > if (matchEmpty(position).anyTrue()) {
> > > return -1;
> > > }
> > >
> > > stride += VECTOR_LENGTH;
> > > position = (position + stride) & bucketMask;
> > > }
> > > }
> > > From Intellij IDEA's profiler, it seems that a large portion of time
> is spent on building the vectormask. I see there is an underlying bTest
> operation converting the results to boolean array and then give the mask.
> Will this be internally optimized to a single movemask operation by JVM?
> > >
> >
> > Can you get an inline/compilation trace like you did for insert?
> >
> > The VectorMask.toLong method is an intrinsic method.
> >
> > Try:
> >
> > var vmask = matchByte(position, h2);
> > var mask = mask.toLong();
> >
> > Probably will not make any difference, but if the findIInsertSlot
> performed ok operating on the mask returned from matchEmptyOrDelete it
> points to an issue with VectorMask.toLong.
> >
> > Paul.
> >
> > >
> > > <Outlook-ejiaczyb.png>
> > > Schrodinger ZHU Yifan, Ph.D. Student
> > > Computer Science Department, University of Rochester
> > >
> > > Personal Email: i at zhuyi.fan
> > > Work Email: yifanzhu at rochester.edu
> > > Website: https://www.cs.rochester.edu/~yzhu104/Main.html
> > > Github: SchrodingerZhu
> > > GPG Fingerprint: BA02CBEB8CB5D8181E9368304D2CC545A78DBCC3
> > >
> > > <Outlook-3nrq0klq.svg>
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230108/a656b40f/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 44468 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230108/a656b40f/image-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Outlook-1nxecacu.png
Type: image/png
Size: 19680 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230108/a656b40f/Outlook-1nxecacu-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Outlook-v4r1y0a4
Type: image/svg+xml
Size: 41432 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230108/a656b40f/Outlook-v4r1y0a4-0001.svg>
More information about the panama-dev
mailing list