回复: [EXT] Follow-up results for SwissTable with Vector API

Zhu, Yifan yzhu104 at UR.Rochester.edu
Sat Jan 7 19:08:57 UTC 2023


Highlights of the dropped message:

  1.  Using ByteVector instead of Vector<Byte> gives better performance
  2.  ByteVector::lt(byte) is not optimized for loop invariant values generating a boardcase on each iteration, so I cache a final value for loop, and it improves the performance
  3.  For swisstable on x86, AVX2/AVX512 has a slightly lower peak thrput due to higher latency. (thou the paralism of wider register is better)Si

--------------------------------------------------------------------------------------

I finall get time to run the full benchmark. Here are current results:

Update: As suggested, I added a generic implementation without vector API to do comparison.

JVM info

# VM version: JDK 19.0.1, Java HotSpot(TM) 64-Bit Server VM, 19.0.1+10-21

Server Info

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  128
  On-line CPU(s) list:   0-127
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7773X 64-Core Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  64
    Socket(s):           1
    Stepping:            2
    Frequency boost:     enabled
    CPU(s) scaling MHz:  64%
    CPU max MHz:         3527.7339
    CPU min MHz:         1500.0000
    BogoMIPS:            4399.93
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid
                         aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skini
                         t wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx sm
                         ap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale
                         vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
Virtualization features:
  Virtualization:        AMD-V
Caches (sum of all):
  L1d:                   2 MiB (64 instances)
  L1i:                   2 MiB (64 instances)
  L2:                    32 MiB (64 instances)
  L3:                    768 MiB (8 instances)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-127
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Random Operation (Three hashmap implementations execute a same randomly generated sequence of insert/find/remove operations. The sequence length is 100000):

Benchmark                                           Mode  Cnt    Score    Error  Units
RandomOperations.genericSwissTableRandomOperation  thrpt    5  264.657 ±  2.645  ops/s
RandomOperations.hashMapRandomOperation            thrpt    5  280.280 ± 11.416  ops/s
RandomOperations.swissTableRandomOperation         thrpt    5  304.007 ±  6.635  ops/s

Find Operation (Three hashmap implementations find the same generated data sequence. The string keys are around 190 bytes. SwissTables are using WyHash as hasher. The sequence length is 100000)

Benchmark                                           Mode  Cnt     Score     Error  Units
FindBenchmark.genericSwissTableFindExistingLong    thrpt    5   440.841 ±   2.525  ops/s
FindBenchmark.genericSwissTableFindExistingString  thrpt    5   500.632 ±   9.925  ops/s
FindBenchmark.genericSwissTableFindMissingLong     thrpt    5   852.220 ±   3.159  ops/s
FindBenchmark.genericSwissTableFindMissingString   thrpt    5  1024.492 ±   6.017  ops/s
FindBenchmark.hashTableFindExistingLong            thrpt    5   848.031 ±   5.488  ops/s
FindBenchmark.hashTableFindExistingString          thrpt    5  1247.213 ±  11.066  ops/s
FindBenchmark.hashTableFindMissingLong             thrpt    5   884.276 ±   0.983  ops/s
FindBenchmark.hashTableFindMissingString           thrpt    5  1386.718 ±  72.998  ops/s
FindBenchmark.swissTableFindExistingLong           thrpt    5   717.106 ±   1.653  ops/s
FindBenchmark.swissTableFindExistingString         thrpt    5   689.134 ±  11.573  ops/s
FindBenchmark.swissTableFindMissingLong            thrpt    5  1143.098 ±   5.033  ops/s
FindBenchmark.swissTableFindMissingString          thrpt    5  1562.995 ± 893.966  ops/s

Notice: I am not sure why java.util.HashMap performs better when finding existing keys, is there any specialization when JVM sees that the finding sequence is the same as the insertion sequence?

Insert Operation (Three hashmap implementations insert the same generated data sequence. The string keys are around 190 bytes. For string, all three tables use cached WyHash value (overloaded method). For long integer, the identity hash is used. The sequence length is 100000)

Benchmark                                             Mode  Cnt    Score    Error  Units
InsertionBenchmark.genericSwissTableLongInsertion    thrpt    5  251.994 ±  6.383  ops/s
InsertionBenchmark.genericSwissTableStringInsertion  thrpt    5  242.455 ±  8.091  ops/s
InsertionBenchmark.hashMapLongInsertion              thrpt    5  157.068 ± 10.528  ops/s
InsertionBenchmark.hashMapStringInsertion            thrpt    5  157.860 ±  5.385  ops/s
InsertionBenchmark.swissTableLongInsertion           thrpt    5  281.914 ±  3.051  ops/s
InsertionBenchmark.swissTableStringInsertion         thrpt    5  265.384 ±  2.884  ops/s

Iteration (Iterate through the whole map using iterator)

Benchmark                              Mode  Cnt     Score      Error  Units
Iteration.genericSwissTableIteration  thrpt    5  1615.441 ±   44.650  ops/s
Iteration.hashMapIteration            thrpt    5   562.280 ±    3.662  ops/s
Iteration.swissTableIteration         thrpt    5  2667.606 ± 1454.997  ops/s



[cid:178fa4da-cb40-4f72-ada8-2546493562f3]
Schrodinger ZHU Yifan, Ph.D. Student
Computer Science Department, University of Rochester

Personal Email: i at zhuyi.fan
Work Email: yifanzhu at rochester.edu
Website: https://www.cs.rochester.edu/~yzhu104/Main.html
Github: SchrodingerZhu
GPG Fingerprint: BA02CBEB8CB5D8181E9368304D2CC545A78DBCC3

[cid:b8a5c99e-343f-4275-8e2f-1084525d9b41]
________________________________
发件人: Gavin Ray <ray.gavin97 at gmail.com>
发送时间: 2023年1月7日 23:28
收件人: Paul Sandoz <paul.sandoz at oracle.com>
抄送: Zhu, Yifan <yzhu104 at UR.Rochester.edu>; panama-dev at openjdk.org <panama-dev at openjdk.org>
主题: Re: [EXT] Follow-up results for SwissTable with Vector API

Zhu I'm very interested in this discussion, in the event there were mails that were dropped, FWIW
A SwissTable implementation based on Vector intrinsics + FFM API would be super useful for a lot of applications.

This is the history that I see:

[image.png]

On Fri, Jan 6, 2023 at 11:32 AM Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote:
In some further replies I just noticed you dropped the panama-dev email. Resend a summary of the discussion?

Paul.

> On Jan 5, 2023, at 2:16 PM, Zhu, Yifan <yzhu104 at UR.Rochester.edu<mailto:yzhu104 at UR.Rochester.edu>> wrote:
>
> I am confused. It seems that my replies are detached from the mailling list. It that expected?
>
>
> <Outlook-3cjuahvq.png>
> Schrodinger ZHU Yifan, Ph.D. Student
> Computer Science Department, University of Rochester
>
> Personal Email: i at zhuyi.fan
> Work Email: yifanzhu at rochester.edu<mailto:yifanzhu at rochester.edu>
> Website: https://www.cs.rochester.edu/~yzhu104/Main.html
> Github: SchrodingerZhu
> GPG Fingerprint: BA02CBEB8CB5D8181E9368304D2CC545A78DBCC3
>
> <Outlook-fjqcbbcv.svg>
> 发件人: panama-dev <panama-dev-retn at openjdk.org<mailto:panama-dev-retn at openjdk.org>> 代表 Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>>
> 发送时间: 2023年1月6日 0:29
> 收件人: Zhu, Yifan <yzhu104 at UR.Rochester.edu<mailto:yzhu104 at UR.Rochester.edu>>
> 抄送: panama-dev at openjdk.org<mailto:panama-dev at openjdk.org> <panama-dev at openjdk.org<mailto:panama-dev at openjdk.org>>
> 主题: [EXT] Re: Follow-up results for SwissTable with Vector API
>
> Hi,
>
> I saw you sent another email prior to this, but for some reason it got lost by the moderation system. (Since you are not a member of the list the emails need to be moderated and approved.)
>
>
> > On Jan 5, 2023, at 8:09 AM, Zhu, Yifan <yzhu104 at UR.Rochester.edu<mailto:yzhu104 at UR.Rochester.edu>> wrote:
> >
> > This is the following up message for https://urldefense.com/v3/__https://mail.openjdk.org/pipermail/jdk-dev/2023-January/007288.html__;!!CGUSO5OYRnA7CQ!c6-WVSHfkvXgbKEtNWhxgdZ9EHDDMbmUz9AbxpvbYN54xt_4LzTwYJd4PdHmueDBCmryWsWBXfjE-Jpw_Cfrf6WyAw$ .
> >
> > > You do:
> > >  converted.intoMemorySegment(MemorySegment.ofArray(control), offset, ByteOrder.nativeOrder());
> > >
> > > Can you just do:
> > >
> > >  converted.intoArray(control, offset);
> >
> >
> > I did so because I found that Vector<Byte> actually does not have that method.
>
> Ah, yes. There could be an perf issue with memory segment access, although since you had to wrap the array in a segment there will be some cost to that. It’s like if you wrapped the control array in a segment and stored in a field it would work better.
>
>
> > After your suggestion, I switched to use ByteVector instead by Vector<Byte>. Surprisingly, this time the hashmap delivers a better performance. It 2~3 times faster during the insertion procedure.
>
> Good!
>
>
> > However, there was still a performance gap behind the standard hashmap during finding precedure.
> >
> > For the ease of discussion, I attach the relevant code here:
> >
> > private int findWithHash(long hash, K key) {
> >     byte h2 = Util.h2(hash); //highest 7 bits
> >     int position = Util.h1(hash) & bucketMask; // h1 is just long to int
> >     int stride = 0;
> >     while (true) {
> >         var mask = matchByte(position, h2).toLong(); // match byte is to load a vector of byte and do equality comparison
> >         while (MaskIterator.hasNext(mask)) {
> >             var bit = MaskIterator.getNext(mask);
> >             mask = MaskIterator.moveNext(mask);
> >             var index = (position + bit) & bucketMask;
> >             if (key.equals(keys[index])) return index;
> >         }
> >
> >         if (matchEmpty(position).anyTrue()) {
> >             return -1;
> >         }
> >
> >         stride += VECTOR_LENGTH;
> >         position = (position + stride) & bucketMask;
> >     }
> > }
> > From Intellij IDEA's profiler, it seems that a large portion of time is spent on building the vectormask. I see there is an underlying bTest operation converting the results to boolean array and then give the mask. Will this be internally optimized to a single movemask operation by JVM?
> >
>
> Can you get an inline/compilation trace like you did for insert?
>
> The VectorMask.toLong method is an intrinsic method.
>
> Try:
>
>   var vmask = matchByte(position, h2);
>   var mask = mask.toLong();
>
> Probably will not make any difference, but if the findIInsertSlot performed ok operating on the mask returned from matchEmptyOrDelete it points to an issue with VectorMask.toLong.
>
> Paul.
>
> >
> > <Outlook-ejiaczyb.png>
> > Schrodinger ZHU Yifan, Ph.D. Student
> > Computer Science Department, University of Rochester
> >
> > Personal Email: i at zhuyi.fan
> > Work Email: yifanzhu at rochester.edu<mailto:yifanzhu at rochester.edu>
> > Website: https://www.cs.rochester.edu/~yzhu104/Main.html
> > Github: SchrodingerZhu
> > GPG Fingerprint: BA02CBEB8CB5D8181E9368304D2CC545A78DBCC3
> >
> > <Outlook-3nrq0klq.svg>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230107/83f9e437/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 44468 bytes
Desc: image.png
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230107/83f9e437/image-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Outlook-1nxecacu.png
Type: image/png
Size: 19680 bytes
Desc: Outlook-1nxecacu.png
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230107/83f9e437/Outlook-1nxecacu-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Outlook-v4r1y0a4
Type: image/svg+xml
Size: 41432 bytes
Desc: Outlook-v4r1y0a4
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230107/83f9e437/Outlook-v4r1y0a4-0001.svg>


More information about the panama-dev mailing list