回复: [EXT] Follow-up results for SwissTable with Vector API
Zhu, Yifan
yzhu104 at UR.Rochester.edu
Sat Jan 7 19:08:57 UTC 2023
Highlights of the dropped message:
1. Using ByteVector instead of Vector<Byte> gives better performance
2. ByteVector::lt(byte) is not optimized for loop invariant values generating a boardcase on each iteration, so I cache a final value for loop, and it improves the performance
3. For swisstable on x86, AVX2/AVX512 has a slightly lower peak thrput due to higher latency. (thou the paralism of wider register is better)Si
--------------------------------------------------------------------------------------
I finall get time to run the full benchmark. Here are current results:
Update: As suggested, I added a generic implementation without vector API to do comparison.
JVM info
# VM version: JDK 19.0.1, Java HotSpot(TM) 64-Bit Server VM, 19.0.1+10-21
Server Info
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7773X 64-Core Processor
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 1
Stepping: 2
Frequency boost: enabled
CPU(s) scaling MHz: 64%
CPU max MHz: 3527.7339
CPU min MHz: 1500.0000
BogoMIPS: 4399.93
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid
aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skini
t wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx sm
ap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale
vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 2 MiB (64 instances)
L1i: 2 MiB (64 instances)
L2: 32 MiB (64 instances)
L3: 768 MiB (8 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-127
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
Random Operation (Three hashmap implementations execute a same randomly generated sequence of insert/find/remove operations. The sequence length is 100000):
Benchmark Mode Cnt Score Error Units
RandomOperations.genericSwissTableRandomOperation thrpt 5 264.657 ± 2.645 ops/s
RandomOperations.hashMapRandomOperation thrpt 5 280.280 ± 11.416 ops/s
RandomOperations.swissTableRandomOperation thrpt 5 304.007 ± 6.635 ops/s
Find Operation (Three hashmap implementations find the same generated data sequence. The string keys are around 190 bytes. SwissTables are using WyHash as hasher. The sequence length is 100000)
Benchmark Mode Cnt Score Error Units
FindBenchmark.genericSwissTableFindExistingLong thrpt 5 440.841 ± 2.525 ops/s
FindBenchmark.genericSwissTableFindExistingString thrpt 5 500.632 ± 9.925 ops/s
FindBenchmark.genericSwissTableFindMissingLong thrpt 5 852.220 ± 3.159 ops/s
FindBenchmark.genericSwissTableFindMissingString thrpt 5 1024.492 ± 6.017 ops/s
FindBenchmark.hashTableFindExistingLong thrpt 5 848.031 ± 5.488 ops/s
FindBenchmark.hashTableFindExistingString thrpt 5 1247.213 ± 11.066 ops/s
FindBenchmark.hashTableFindMissingLong thrpt 5 884.276 ± 0.983 ops/s
FindBenchmark.hashTableFindMissingString thrpt 5 1386.718 ± 72.998 ops/s
FindBenchmark.swissTableFindExistingLong thrpt 5 717.106 ± 1.653 ops/s
FindBenchmark.swissTableFindExistingString thrpt 5 689.134 ± 11.573 ops/s
FindBenchmark.swissTableFindMissingLong thrpt 5 1143.098 ± 5.033 ops/s
FindBenchmark.swissTableFindMissingString thrpt 5 1562.995 ± 893.966 ops/s
Notice: I am not sure why java.util.HashMap performs better when finding existing keys, is there any specialization when JVM sees that the finding sequence is the same as the insertion sequence?
Insert Operation (Three hashmap implementations insert the same generated data sequence. The string keys are around 190 bytes. For string, all three tables use cached WyHash value (overloaded method). For long integer, the identity hash is used. The sequence length is 100000)
Benchmark Mode Cnt Score Error Units
InsertionBenchmark.genericSwissTableLongInsertion thrpt 5 251.994 ± 6.383 ops/s
InsertionBenchmark.genericSwissTableStringInsertion thrpt 5 242.455 ± 8.091 ops/s
InsertionBenchmark.hashMapLongInsertion thrpt 5 157.068 ± 10.528 ops/s
InsertionBenchmark.hashMapStringInsertion thrpt 5 157.860 ± 5.385 ops/s
InsertionBenchmark.swissTableLongInsertion thrpt 5 281.914 ± 3.051 ops/s
InsertionBenchmark.swissTableStringInsertion thrpt 5 265.384 ± 2.884 ops/s
Iteration (Iterate through the whole map using iterator)
Benchmark Mode Cnt Score Error Units
Iteration.genericSwissTableIteration thrpt 5 1615.441 ± 44.650 ops/s
Iteration.hashMapIteration thrpt 5 562.280 ± 3.662 ops/s
Iteration.swissTableIteration thrpt 5 2667.606 ± 1454.997 ops/s
[cid:178fa4da-cb40-4f72-ada8-2546493562f3]
Schrodinger ZHU Yifan, Ph.D. Student
Computer Science Department, University of Rochester
Personal Email: i at zhuyi.fan
Work Email: yifanzhu at rochester.edu
Website: https://www.cs.rochester.edu/~yzhu104/Main.html
Github: SchrodingerZhu
GPG Fingerprint: BA02CBEB8CB5D8181E9368304D2CC545A78DBCC3
[cid:b8a5c99e-343f-4275-8e2f-1084525d9b41]
________________________________
发件人: Gavin Ray <ray.gavin97 at gmail.com>
发送时间: 2023年1月7日 23:28
收件人: Paul Sandoz <paul.sandoz at oracle.com>
抄送: Zhu, Yifan <yzhu104 at UR.Rochester.edu>; panama-dev at openjdk.org <panama-dev at openjdk.org>
主题: Re: [EXT] Follow-up results for SwissTable with Vector API
Zhu I'm very interested in this discussion, in the event there were mails that were dropped, FWIW
A SwissTable implementation based on Vector intrinsics + FFM API would be super useful for a lot of applications.
This is the history that I see:
[image.png]
On Fri, Jan 6, 2023 at 11:32 AM Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote:
In some further replies I just noticed you dropped the panama-dev email. Resend a summary of the discussion?
Paul.
> On Jan 5, 2023, at 2:16 PM, Zhu, Yifan <yzhu104 at UR.Rochester.edu<mailto:yzhu104 at UR.Rochester.edu>> wrote:
>
> I am confused. It seems that my replies are detached from the mailling list. It that expected?
>
>
> <Outlook-3cjuahvq.png>
> Schrodinger ZHU Yifan, Ph.D. Student
> Computer Science Department, University of Rochester
>
> Personal Email: i at zhuyi.fan
> Work Email: yifanzhu at rochester.edu<mailto:yifanzhu at rochester.edu>
> Website: https://www.cs.rochester.edu/~yzhu104/Main.html
> Github: SchrodingerZhu
> GPG Fingerprint: BA02CBEB8CB5D8181E9368304D2CC545A78DBCC3
>
> <Outlook-fjqcbbcv.svg>
> 发件人: panama-dev <panama-dev-retn at openjdk.org<mailto:panama-dev-retn at openjdk.org>> 代表 Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>>
> 发送时间: 2023年1月6日 0:29
> 收件人: Zhu, Yifan <yzhu104 at UR.Rochester.edu<mailto:yzhu104 at UR.Rochester.edu>>
> 抄送: panama-dev at openjdk.org<mailto:panama-dev at openjdk.org> <panama-dev at openjdk.org<mailto:panama-dev at openjdk.org>>
> 主题: [EXT] Re: Follow-up results for SwissTable with Vector API
>
> Hi,
>
> I saw you sent another email prior to this, but for some reason it got lost by the moderation system. (Since you are not a member of the list the emails need to be moderated and approved.)
>
>
> > On Jan 5, 2023, at 8:09 AM, Zhu, Yifan <yzhu104 at UR.Rochester.edu<mailto:yzhu104 at UR.Rochester.edu>> wrote:
> >
> > This is the following up message for https://urldefense.com/v3/__https://mail.openjdk.org/pipermail/jdk-dev/2023-January/007288.html__;!!CGUSO5OYRnA7CQ!c6-WVSHfkvXgbKEtNWhxgdZ9EHDDMbmUz9AbxpvbYN54xt_4LzTwYJd4PdHmueDBCmryWsWBXfjE-Jpw_Cfrf6WyAw$ .
> >
> > > You do:
> > > converted.intoMemorySegment(MemorySegment.ofArray(control), offset, ByteOrder.nativeOrder());
> > >
> > > Can you just do:
> > >
> > > converted.intoArray(control, offset);
> >
> >
> > I did so because I found that Vector<Byte> actually does not have that method.
>
> Ah, yes. There could be an perf issue with memory segment access, although since you had to wrap the array in a segment there will be some cost to that. It’s like if you wrapped the control array in a segment and stored in a field it would work better.
>
>
> > After your suggestion, I switched to use ByteVector instead by Vector<Byte>. Surprisingly, this time the hashmap delivers a better performance. It 2~3 times faster during the insertion procedure.
>
> Good!
>
>
> > However, there was still a performance gap behind the standard hashmap during finding precedure.
> >
> > For the ease of discussion, I attach the relevant code here:
> >
> > private int findWithHash(long hash, K key) {
> > byte h2 = Util.h2(hash); //highest 7 bits
> > int position = Util.h1(hash) & bucketMask; // h1 is just long to int
> > int stride = 0;
> > while (true) {
> > var mask = matchByte(position, h2).toLong(); // match byte is to load a vector of byte and do equality comparison
> > while (MaskIterator.hasNext(mask)) {
> > var bit = MaskIterator.getNext(mask);
> > mask = MaskIterator.moveNext(mask);
> > var index = (position + bit) & bucketMask;
> > if (key.equals(keys[index])) return index;
> > }
> >
> > if (matchEmpty(position).anyTrue()) {
> > return -1;
> > }
> >
> > stride += VECTOR_LENGTH;
> > position = (position + stride) & bucketMask;
> > }
> > }
> > From Intellij IDEA's profiler, it seems that a large portion of time is spent on building the vectormask. I see there is an underlying bTest operation converting the results to boolean array and then give the mask. Will this be internally optimized to a single movemask operation by JVM?
> >
>
> Can you get an inline/compilation trace like you did for insert?
>
> The VectorMask.toLong method is an intrinsic method.
>
> Try:
>
> var vmask = matchByte(position, h2);
> var mask = mask.toLong();
>
> Probably will not make any difference, but if the findIInsertSlot performed ok operating on the mask returned from matchEmptyOrDelete it points to an issue with VectorMask.toLong.
>
> Paul.
>
> >
> > <Outlook-ejiaczyb.png>
> > Schrodinger ZHU Yifan, Ph.D. Student
> > Computer Science Department, University of Rochester
> >
> > Personal Email: i at zhuyi.fan
> > Work Email: yifanzhu at rochester.edu<mailto:yifanzhu at rochester.edu>
> > Website: https://www.cs.rochester.edu/~yzhu104/Main.html
> > Github: SchrodingerZhu
> > GPG Fingerprint: BA02CBEB8CB5D8181E9368304D2CC545A78DBCC3
> >
> > <Outlook-3nrq0klq.svg>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230107/83f9e437/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 44468 bytes
Desc: image.png
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230107/83f9e437/image-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Outlook-1nxecacu.png
Type: image/png
Size: 19680 bytes
Desc: Outlook-1nxecacu.png
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230107/83f9e437/Outlook-1nxecacu-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Outlook-v4r1y0a4
Type: image/svg+xml
Size: 41432 bytes
Desc: Outlook-v4r1y0a4
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230107/83f9e437/Outlook-v4r1y0a4-0001.svg>
More information about the panama-dev
mailing list