[EXT] Follow-up results for SwissTable with Vector API

Sun Jan 8 17:36:03 UTC 2023

Thanks Yifan, much appreciated!

It seems like a compelling implementation for read-heavy workloads.
I too would be curious about the reasons for performance difference between
the default HashMap for other operations.

On Sat, Jan 7, 2023 at 2:09 PM Zhu, Yifan <yzhu104 at ur.rochester.edu> wrote:

> Highlights of the dropped message:
>
>
>    1. Using ByteVector instead of Vector<Byte> gives better performance
>    2. ByteVector::lt(byte) is not optimized for loop invariant values
>    generating a boardcase on each iteration, so I cache a final value for
>    loop, and it improves the performance
>    3. For swisstable on x86, AVX2/AVX512 has a slightly lower peak thrput
>    due to higher latency. (thou the paralism of wider register is better)Si
>
>
> --------------------------------------------------------------------------------------
>
> I finall get time to run the full benchmark. Here are current results:
>
> *Update:* As suggested, I added a generic implementation without vector
> API to do comparison.
>
> JVM info
>
> # VM version: JDK 19.0.1, Java HotSpot(TM) 64-Bit Server VM, 19.0.1+10-21
>
> Server Info
>
> Architecture:            x86_64
>   CPU op-mode(s):        32-bit, 64-bit
>   Address sizes:         48 bits physical, 48 bits virtual
>   Byte Order:            Little Endian
> CPU(s):                  128
>   On-line CPU(s) list:   0-127
> Vendor ID:               AuthenticAMD
>   Model name:            AMD EPYC 7773X 64-Core Processor
>     CPU family:          25
>     Model:               1
>     Thread(s) per core:  2
>     Core(s) per socket:  64
>     Socket(s):           1
>     Stepping:            2
>     Frequency boost:     enabled
>     CPU(s) scaling MHz:  64%
>     CPU max MHz:         3527.7339
>     CPU min MHz:         1500.0000
>     BogoMIPS:            4399.93
>     Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid
>                          aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skini
>                          t wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx sm
>                          ap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale
>                          vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
> Virtualization features:
>   Virtualization:        AMD-V
> Caches (sum of all):
>   L1d:                   2 MiB (64 instances)
>   L1i:                   2 MiB (64 instances)
>   L2:                    32 MiB (64 instances)
>   L3:                    768 MiB (8 instances)
> NUMA:
>   NUMA node(s):          1
>   NUMA node0 CPU(s):     0-127
> Vulnerabilities:
>   Itlb multihit:         Not affected
>   L1tf:                  Not affected
>   Mds:                   Not affected
>   Meltdown:              Not affected
>   Mmio stale data:       Not affected
>   Retbleed:              Not affected
>   Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
>   Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
>   Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
>   Srbds:                 Not affected
>   Tsx async abort:       Not affected
>
> Random Operation (Three hashmap implementations execute a same randomly
> generated sequence of insert/find/remove operations. The sequence length is
> 100000):
>
> Benchmark                                           Mode  Cnt    Score    Error  Units
> RandomOperations.genericSwissTableRandomOperation  thrpt    5  264.657 ±  2.645  ops/s
> RandomOperations.hashMapRandomOperation            thrpt    5  280.280 ± 11.416  ops/s
> RandomOperations.swissTableRandomOperation         thrpt    5  304.007 ±  6.635  ops/s
>
> Find Operation (Three hashmap implementations find the same generated data
> sequence. The string keys are around 190 bytes. SwissTables are using
> WyHash as hasher. The sequence length is 100000)
>
> Benchmark                                           Mode  Cnt     Score     Error  Units
> FindBenchmark.genericSwissTableFindExistingLong    thrpt    5   440.841 ±   2.525  ops/s
> FindBenchmark.genericSwissTableFindExistingString  thrpt    5   500.632 ±   9.925  ops/s
> FindBenchmark.genericSwissTableFindMissingLong     thrpt    5   852.220 ±   3.159  ops/s
> FindBenchmark.genericSwissTableFindMissingString   thrpt    5  1024.492 ±   6.017  ops/s
> FindBenchmark.hashTableFindExistingLong            thrpt    5   848.031 ±   5.488  ops/s
> FindBenchmark.hashTableFindExistingString          thrpt    5  1247.213 ±  11.066  ops/s
> FindBenchmark.hashTableFindMissingLong             thrpt    5   884.276 ±   0.983  ops/s
> FindBenchmark.hashTableFindMissingString           thrpt    5  1386.718 ±  72.998  ops/s
> FindBenchmark.swissTableFindExistingLong           thrpt    5   717.106 ±   1.653  ops/s
> FindBenchmark.swissTableFindExistingString         thrpt    5   689.134 ±  11.573  ops/s
> FindBenchmark.swissTableFindMissingLong            thrpt    5  1143.098 ±   5.033  ops/s
> FindBenchmark.swissTableFindMissingString          thrpt    5  1562.995 ± 893.966  ops/s
>
> *Notice: I am not sure why java.util.HashMap performs better when finding
> existing keys, is there any specialization when JVM sees that the finding
> sequence is the same as the insertion sequence?*
>
> Insert Operation (Three hashmap implementations insert the same generated
> data sequence. The string keys are around 190 bytes. For string, all three
> tables use cached WyHash value (overloaded method). For long integer, the
> identity hash is used. The sequence length is 100000)
>
> Benchmark                                             Mode  Cnt    Score    Error  Units
> InsertionBenchmark.genericSwissTableLongInsertion    thrpt    5  251.994 ±  6.383  ops/s
> InsertionBenchmark.genericSwissTableStringInsertion  thrpt    5  242.455 ±  8.091  ops/s
> InsertionBenchmark.hashMapLongInsertion              thrpt    5  157.068 ± 10.528  ops/s
> InsertionBenchmark.hashMapStringInsertion            thrpt    5  157.860 ±  5.385  ops/s
> InsertionBenchmark.swissTableLongInsertion           thrpt    5  281.914 ±  3.051  ops/s
> InsertionBenchmark.swissTableStringInsertion         thrpt    5  265.384 ±  2.884  ops/s
>
> Iteration (Iterate through the whole map using iterator)
>
> Benchmark                              Mode  Cnt     Score      Error  Units
> Iteration.genericSwissTableIteration  thrpt    5  1615.441 ±   44.650  ops/s
> Iteration.hashMapIteration            thrpt    5   562.280 ±    3.662  ops/s
> Iteration.swissTableIteration         thrpt    5  2667.606 ± 1454.997  ops/s
>
>
>
>
>
> Schrodinger ZHU Yifan, Ph.D. Student
> Computer Science Department, University of Rochester
>
> *Personal Email:* i at zhuyi.fan
> *Work Email:* yifanzhu at rochester.edu
> *Website:* https://www.cs.rochester.edu/~yzhu104/Main.html
> *Github:* SchrodingerZhu
> *GPG Fingerprint:* BA02CBEB8CB5D8181E9368304D2CC545A78DBCC3
>
>
> ------------------------------
> *发件人:* Gavin Ray <ray.gavin97 at gmail.com>
> *发送时间:* 2023年1月7日 23:28
> *收件人:* Paul Sandoz <paul.sandoz at oracle.com>
> *抄送:* Zhu, Yifan <yzhu104 at UR.Rochester.edu>; panama-dev at openjdk.org <
> panama-dev at openjdk.org>
> *主题:* Re: [EXT] Follow-up results for SwissTable with Vector API
>
> Zhu I'm very interested in this discussion, in the event there were mails
> that were dropped, FWIW
> A SwissTable implementation based on Vector intrinsics + FFM API would be
> super useful for a lot of applications.
>
> This is the history that I see:
>
> [image: image.png]
>
> On Fri, Jan 6, 2023 at 11:32 AM Paul Sandoz <paul.sandoz at oracle.com>
> wrote:
>
> In some further replies I just noticed you dropped the panama-dev email.
> Resend a summary of the discussion?
>
> Paul.
>
> > On Jan 5, 2023, at 2:16 PM, Zhu, Yifan <yzhu104 at UR.Rochester.edu> wrote:
> >
> > I am confused. It seems that my replies are detached from the mailling
> list. It that expected?
> >
> >
> > <Outlook-3cjuahvq.png>
> > Schrodinger ZHU Yifan, Ph.D. Student
> > Computer Science Department, University of Rochester
> >
> > Personal Email: i at zhuyi.fan
> > Work Email: yifanzhu at rochester.edu
> > Website: https://www.cs.rochester.edu/~yzhu104/Main.html
> > Github: SchrodingerZhu
> > GPG Fingerprint: BA02CBEB8CB5D8181E9368304D2CC545A78DBCC3
> >
> > <Outlook-fjqcbbcv.svg>
> > 发件人: panama-dev <panama-dev-retn at openjdk.org> 代表 Paul Sandoz <
> paul.sandoz at oracle.com>
> > 发送时间: 2023年1月6日 0:29
> > 收件人: Zhu, Yifan <yzhu104 at UR.Rochester.edu>
> > 抄送: panama-dev at openjdk.org <panama-dev at openjdk.org>
> > 主题: [EXT] Re: Follow-up results for SwissTable with Vector API
> >
> > Hi,
> >
> > I saw you sent another email prior to this, but for some reason it got
> lost by the moderation system. (Since you are not a member of the list the
> emails need to be moderated and approved.)
> >
> >
> > > On Jan 5, 2023, at 8:09 AM, Zhu, Yifan <yzhu104 at UR.Rochester.edu>
> wrote:
> > >
> > > This is the following up message for
> https://urldefense.com/v3/__https://mail.openjdk.org/pipermail/jdk-dev/2023-January/007288.html__;!!CGUSO5OYRnA7CQ!c6-WVSHfkvXgbKEtNWhxgdZ9EHDDMbmUz9AbxpvbYN54xt_4LzTwYJd4PdHmueDBCmryWsWBXfjE-Jpw_Cfrf6WyAw$
> .
> > >
> > > > You do:
> > > >  converted.intoMemorySegment(MemorySegment.ofArray(control), offset,
> ByteOrder.nativeOrder());
> > > >
> > > > Can you just do:
> > > >
> > > >  converted.intoArray(control, offset);
> > >
> > >
> > > I did so because I found that Vector<Byte> actually does not have that
> method.
> >
> > Ah, yes. There could be an perf issue with memory segment access,
> although since you had to wrap the array in a segment there will be some
> cost to that. It’s like if you wrapped the control array in a segment and
> stored in a field it would work better.
> >
> >
> > > After your suggestion, I switched to use ByteVector instead by
> Vector<Byte>. Surprisingly, this time the hashmap delivers a better
> performance. It 2~3 times faster during the insertion procedure.
> >
> > Good!
> >
> >
> > > However, there was still a performance gap behind the standard hashmap
> during finding precedure.
> > >
> > > For the ease of discussion, I attach the relevant code here:
> > >
> > > private int findWithHash(long hash, K key) {
> > >     byte h2 = Util.h2(hash); //highest 7 bits
> > >     int position = Util.h1(hash) & bucketMask; // h1 is just long to
> int
> > >     int stride = 0;
> > >     while (true) {
> > >         var mask = matchByte(position, h2).toLong(); // match byte is
> to load a vector of byte and do equality comparison
> > >         while (MaskIterator.hasNext(mask)) {
> > >             var bit = MaskIterator.getNext(mask);
> > >             mask = MaskIterator.moveNext(mask);
> > >             var index = (position + bit) & bucketMask;
> > >             if (key.equals(keys[index])) return index;
> > >         }
> > >
> > >         if (matchEmpty(position).anyTrue()) {
> > >             return -1;
> > >         }
> > >
> > >         stride += VECTOR_LENGTH;
> > >         position = (position + stride) & bucketMask;
> > >     }
> > > }
> > > From Intellij IDEA's profiler, it seems that a large portion of time
> is spent on building the vectormask. I see there is an underlying bTest
> operation converting the results to boolean array and then give the mask.
> Will this be internally optimized to a single movemask operation by JVM?
> > >
> >
> > Can you get an inline/compilation trace like you did for insert?
> >
> > The VectorMask.toLong method is an intrinsic method.
> >
> > Try:
> >
> >   var vmask = matchByte(position, h2);
> >   var mask = mask.toLong();
> >
> > Probably will not make any difference, but if the findIInsertSlot
> performed ok operating on the mask returned from matchEmptyOrDelete it
> points to an issue with VectorMask.toLong.
> >
> > Paul.
> >
> > >
> > > <Outlook-ejiaczyb.png>
> > > Schrodinger ZHU Yifan, Ph.D. Student
> > > Computer Science Department, University of Rochester
> > >
> > > Personal Email: i at zhuyi.fan
> > > Work Email: yifanzhu at rochester.edu
> > > Website: https://www.cs.rochester.edu/~yzhu104/Main.html
> > > Github: SchrodingerZhu
> > > GPG Fingerprint: BA02CBEB8CB5D8181E9368304D2CC545A78DBCC3
> > >
> > > <Outlook-3nrq0klq.svg>
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230108/a656b40f/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 44468 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230108/a656b40f/image-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Outlook-1nxecacu.png
Type: image/png
Size: 19680 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230108/a656b40f/Outlook-1nxecacu-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Outlook-v4r1y0a4
Type: image/svg+xml
Size: 41432 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230108/a656b40f/Outlook-v4r1y0a4-0001.svg>