Using x86 pause instr in SpinPause
Peter Levart
peter.levart at marand.si
Wed Aug 29 09:43:21 PDT 2012
I played with PAUSE instruction a little on my i7 machine.
What I found out is that on i7 (4 cores, 2 threads per core) with NOP the wait loop spins about 3 - 4 times faster then with PAUSE instruction although all 4 CPU cores run with Max. Frequency in both cases.
Also it seems that a loop using PAUSE spins at much more "constant" speed and so the lock becomes more fair then the same lock using NOP in the spin-loop.
With high number of threads the efficiency is also better when using PAUSE in a loop instead of just NOP.
Here's how I tested:
I tried to mimic the hotspot's Thread::SpinAcquire/SpinRelease so I ripped-off some code fragments from hotspot sources. Here's a simplified SpinLock.c implementation that is not resorting to park after 5 yields but continues to yield every 4096 spins indefinitely:
#include <sched.h>
inline void fence() {
// always use locked addl since mfence is sometimes expensive
#ifdef AMD64
__asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
#else
__asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory");
#endif
}
inline int cmpxchg(int exchange_value, volatile int* dest, int compare_value) {
__asm__ volatile ("lock; cmpxchgl %1,(%3)"
: "=a" (exchange_value)
: "r" (exchange_value), "a" (compare_value), "r" (dest)
: "cc", "memory");
return exchange_value;
}
int SpinPause () ;
int SpinAcquire (volatile int * adr) {
if (cmpxchg (1, adr, 0) == 0) {
return 0; // normal fast-path return
}
// Slow-path : We've encountered contention -- Spin/Yield strategy.
int ctr = 0 ;
int Yields = 0 ;
for (;;) {
while (*adr != 0) {
++ctr ;
if ((ctr & 0xFFF) == 0) {
sched_yield() ;
++Yields ;
} else {
SpinPause() ;
}
}
if (cmpxchg (1, adr, 0) == 0) return Yields;
}
}
void SpinRelease (volatile int * adr) {
fence() ; // guarantee at least release consistency.
// Roach-motel semantics.
// It's safe if subsequent LDs and STs float "up" into the critical section,
// but prior LDs and STs within the critical section can't be allowed
// to reorder or float past the ST that releases the lock.
*adr = 0 ;
}
... the SpinPause.s is either NOP or PAUSE:
.globl SpinPause
.align 16
.type SpinPause, at function
SpinPause:
rep
nop
movq $1, %rax
ret
The benchmark is designed arround a relatively small "critical" section of code that is guarded with single spin/yield-lock and executed repeatedly by multiple threads.
The size of critical section is tuned so that when using a PAUSE in the spin-loop and 10 concurrent threads, each lock acquire takes in average 3-4 yields with 4096 spins between each yield before returning.
Each thread executes a constant number of acquires and critical sections, so the amount of "useful work" is constant.
Number of threads is then varied: 10, 30, 100 and the type of spin-lock too: NOP / PAUSE
Here's the code:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include "SpinLock.h"
#define THREADS 10
#define OUTER_LOOP_SIZE 100000
#define WORK_LOOP_SIZE 100
void *runnable( void *ptr );
void main(int argc, char *argv[])
{
pthread_t thread[THREADS];
int yields[THREADS];
int i;
for (i = 0; i < THREADS; i++)
{
yields[i] = 0;
pthread_create( &thread[i], NULL, runnable, (void*) &yields[i]);
}
for (i = 0; i < THREADS; i++)
pthread_join( thread[i], NULL);
printf("Acquires per thread: %d\n", OUTER_LOOP_SIZE);
for (i = 0; i < THREADS; i++)
printf("Thread %d total yields: %d\n", i, yields[i]);
exit(0);
}
int lock = 0;
void *runnable( void *ptr )
{
int *thread_yields;
thread_yields = (int *) ptr;
int yields = 0;
int i, j;
int sum;
for (i = 0; i < OUTER_LOOP_SIZE; i++)
{
yields += SpinAcquire(&lock);
for (j = 0; j < WORK_LOOP_SIZE; j++)
{
int k;
for (k = 0; k < 100; k++)
sum += yields;
}
SpinRelease(&lock);
}
*thread_yields = sum + yields - sum;
}
Here are the results:
10 threads
----------
NOP:
[peter at peterl src]$ time ./SpinLockTest
Acquires per thread: 100000
Thread 0 total yields: 1569102
Thread 1 total yields: 1240318
Thread 2 total yields: 1339807
Thread 3 total yields: 1529634
Thread 4 total yields: 1215736
Thread 5 total yields: 1400302
Thread 6 total yields: 1501579
Thread 7 total yields: 1066652
Thread 8 total yields: 699626
Thread 9 total yields: 1295480
real 0m18.428s
user 2m15.209s
sys 0m4.084s
PAUSE:
[peter at peterl src]$ time ./SpinLockTest
Acquires per thread: 100000
Thread 0 total yields: 151316
Thread 1 total yields: 285822
Thread 2 total yields: 321386
Thread 3 total yields: 493483
Thread 4 total yields: 306425
Thread 5 total yields: 420804
Thread 6 total yields: 330618
Thread 7 total yields: 176552
Thread 8 total yields: 434332
Thread 9 total yields: 474757
real 0m17.761s
user 1m58.394s
sys 0m0.936s
30 threads
----------
NOP:
[peter at peterl src]$ time ./SpinLockTest
Acquires per thread: 100000
Thread 0 total yields: 1516779
Thread 1 total yields: 2007878
Thread 2 total yields: 1027554
Thread 3 total yields: 2085868
Thread 4 total yields: 1152372
Thread 5 total yields: 575832
Thread 6 total yields: 1742543
Thread 7 total yields: 1083592
Thread 8 total yields: 1522944
Thread 9 total yields: 1906178
Thread 10 total yields: 1127148
Thread 11 total yields: 1872796
Thread 12 total yields: 951841
Thread 13 total yields: 1997159
Thread 14 total yields: 1347902
Thread 15 total yields: 1490622
Thread 16 total yields: 1836486
Thread 17 total yields: 2037498
Thread 18 total yields: 1648107
Thread 19 total yields: 1415434
Thread 20 total yields: 2044485
Thread 21 total yields: 644851
Thread 22 total yields: 599960
Thread 23 total yields: 1328082
Thread 24 total yields: 612360
Thread 25 total yields: 640560
Thread 26 total yields: 541422
Thread 27 total yields: 2018600
Thread 28 total yields: 605493
Thread 29 total yields: 1994882
real 0m57.296s
user 7m12.540s
sys 0m18.230s
PAUSE:
[peter at peterl src]$ time ./SpinLockTest
Acquires per thread: 100000
Thread 0 total yields: 470464
Thread 1 total yields: 361301
Thread 2 total yields: 349293
Thread 3 total yields: 364812
Thread 4 total yields: 360606
Thread 5 total yields: 517756
Thread 6 total yields: 421567
Thread 7 total yields: 301124
Thread 8 total yields: 555139
Thread 9 total yields: 481553
Thread 10 total yields: 634827
Thread 11 total yields: 344431
Thread 12 total yields: 308426
Thread 13 total yields: 405738
Thread 14 total yields: 349320
Thread 15 total yields: 563739
Thread 16 total yields: 301356
Thread 17 total yields: 350707
Thread 18 total yields: 329854
Thread 19 total yields: 471155
Thread 20 total yields: 380747
Thread 21 total yields: 533402
Thread 22 total yields: 638196
Thread 23 total yields: 628245
Thread 24 total yields: 342036
Thread 25 total yields: 356593
Thread 26 total yields: 351789
Thread 27 total yields: 300021
Thread 28 total yields: 290741
Thread 29 total yields: 484845
real 0m56.744s
user 7m12.831s
sys 0m5.127s
100 threads
-----------
NOP:
[peter at peterl src]$ time ./SpinLockTest
Acquires per thread: 100000
Thread 0 total yields: 715692
Thread 1 total yields: 3956169
Thread 2 total yields: 1043740
Thread 3 total yields: 1029961
Thread 4 total yields: 1499409
Thread 5 total yields: 2012256
Thread 6 total yields: 2683384
Thread 7 total yields: 3229792
Thread 8 total yields: 2053604
Thread 9 total yields: 537083
Thread 10 total yields: 3068300
Thread 11 total yields: 485165
Thread 12 total yields: 1294566
Thread 13 total yields: 477668
Thread 14 total yields: 1935177
Thread 15 total yields: 1320217
Thread 16 total yields: 3630811
Thread 17 total yields: 2002443
Thread 18 total yields: 2573688
Thread 19 total yields: 656375
Thread 20 total yields: 723496
Thread 21 total yields: 2867001
Thread 22 total yields: 3462940
Thread 23 total yields: 2638588
Thread 24 total yields: 3591878
Thread 25 total yields: 674437
Thread 26 total yields: 1767265
Thread 27 total yields: 3254028
Thread 28 total yields: 1148442
Thread 29 total yields: 3232056
Thread 30 total yields: 1710429
Thread 31 total yields: 487365
Thread 32 total yields: 475716
Thread 33 total yields: 672629
Thread 34 total yields: 2235400
Thread 35 total yields: 1073231
Thread 36 total yields: 1564212
Thread 37 total yields: 1232321
Thread 38 total yields: 1668370
Thread 39 total yields: 3926584
Thread 40 total yields: 3639128
Thread 41 total yields: 2135553
Thread 42 total yields: 2410193
Thread 43 total yields: 465033
Thread 44 total yields: 2267986
Thread 45 total yields: 2556756
Thread 46 total yields: 1233673
Thread 47 total yields: 2296487
Thread 48 total yields: 1569566
Thread 49 total yields: 2087966
Thread 50 total yields: 1141489
Thread 51 total yields: 2895012
Thread 52 total yields: 1318840
Thread 53 total yields: 463860
Thread 54 total yields: 3063963
Thread 55 total yields: 1602932
Thread 56 total yields: 2998556
Thread 57 total yields: 3052040
Thread 58 total yields: 2936234
Thread 59 total yields: 1608002
Thread 60 total yields: 1947606
Thread 61 total yields: 3650902
Thread 62 total yields: 481011
Thread 63 total yields: 2946211
Thread 64 total yields: 2657741
Thread 65 total yields: 1854059
Thread 66 total yields: 612458
Thread 67 total yields: 3858631
Thread 68 total yields: 3645990
Thread 69 total yields: 2916354
Thread 70 total yields: 1587217
Thread 71 total yields: 625513
Thread 72 total yields: 810526
Thread 73 total yields: 3230899
Thread 74 total yields: 3117595
Thread 75 total yields: 680967
Thread 76 total yields: 1925092
Thread 77 total yields: 2205682
Thread 78 total yields: 2669335
Thread 79 total yields: 699507
Thread 80 total yields: 462614
Thread 81 total yields: 1108081
Thread 82 total yields: 998706
Thread 83 total yields: 1625907
Thread 84 total yields: 1364484
Thread 85 total yields: 2698464
Thread 86 total yields: 1132631
Thread 87 total yields: 1272493
Thread 88 total yields: 544296
Thread 89 total yields: 642514
Thread 90 total yields: 1659716
Thread 91 total yields: 3657423
Thread 92 total yields: 1152010
Thread 93 total yields: 864437
Thread 94 total yields: 1914716
Thread 95 total yields: 665765
Thread 96 total yields: 470625
Thread 97 total yields: 1515056
Thread 98 total yields: 1694343
Thread 99 total yields: 656651
real 4m9.802s
user 31m31.862s
sys 1m30.147s
PAUSE:
[peter at peterl src]$ time ./SpinLockTest
Acquires per thread: 100000
Thread 0 total yields: 423635
Thread 1 total yields: 432373
Thread 2 total yields: 525277
Thread 3 total yields: 403625
Thread 4 total yields: 435605
Thread 5 total yields: 535598
Thread 6 total yields: 508091
Thread 7 total yields: 548162
Thread 8 total yields: 532789
Thread 9 total yields: 769968
Thread 10 total yields: 442565
Thread 11 total yields: 638749
Thread 12 total yields: 488601
Thread 13 total yields: 579624
Thread 14 total yields: 653364
Thread 15 total yields: 437042
Thread 16 total yields: 515524
Thread 17 total yields: 502716
Thread 18 total yields: 445556
Thread 19 total yields: 569855
Thread 20 total yields: 436675
Thread 21 total yields: 434850
Thread 22 total yields: 409440
Thread 23 total yields: 733155
Thread 24 total yields: 708836
Thread 25 total yields: 504327
Thread 26 total yields: 501003
Thread 27 total yields: 772928
Thread 28 total yields: 404170
Thread 29 total yields: 564273
Thread 30 total yields: 447199
Thread 31 total yields: 518266
Thread 32 total yields: 751953
Thread 33 total yields: 528898
Thread 34 total yields: 469453
Thread 35 total yields: 443822
Thread 36 total yields: 453571
Thread 37 total yields: 483523
Thread 38 total yields: 673307
Thread 39 total yields: 419745
Thread 40 total yields: 420812
Thread 41 total yields: 579195
Thread 42 total yields: 534738
Thread 43 total yields: 558074
Thread 44 total yields: 404649
Thread 45 total yields: 690615
Thread 46 total yields: 457234
Thread 47 total yields: 623036
Thread 48 total yields: 700575
Thread 49 total yields: 608860
Thread 50 total yields: 405334
Thread 51 total yields: 577808
Thread 52 total yields: 449998
Thread 53 total yields: 473125
Thread 54 total yields: 558360
Thread 55 total yields: 406760
Thread 56 total yields: 621827
Thread 57 total yields: 456095
Thread 58 total yields: 700446
Thread 59 total yields: 696581
Thread 60 total yields: 657749
Thread 61 total yields: 771747
Thread 62 total yields: 425028
Thread 63 total yields: 416165
Thread 64 total yields: 416922
Thread 65 total yields: 748436
Thread 66 total yields: 710466
Thread 67 total yields: 431879
Thread 68 total yields: 407904
Thread 69 total yields: 516825
Thread 70 total yields: 404993
Thread 71 total yields: 432439
Thread 72 total yields: 762656
Thread 73 total yields: 512795
Thread 74 total yields: 443227
Thread 75 total yields: 627807
Thread 76 total yields: 506496
Thread 77 total yields: 550415
Thread 78 total yields: 420525
Thread 79 total yields: 547715
Thread 80 total yields: 693714
Thread 81 total yields: 708321
Thread 82 total yields: 438307
Thread 83 total yields: 400358
Thread 84 total yields: 555737
Thread 85 total yields: 555062
Thread 86 total yields: 518418
Thread 87 total yields: 448384
Thread 88 total yields: 776252
Thread 89 total yields: 584543
Thread 90 total yields: 578234
Thread 91 total yields: 426361
Thread 92 total yields: 572331
Thread 93 total yields: 511610
Thread 94 total yields: 573504
Thread 95 total yields: 596209
Thread 96 total yields: 462665
Thread 97 total yields: 455619
Thread 98 total yields: 670371
Thread 99 total yields: 634330
real 3m50.676s
user 30m8.317s
sys 0m24.138s
With 100 threads the difference starts to show.
Regards, Peter
On Wednesday, August 29, 2012 08:31:23 AM Vitaly Davidovich wrote:
By the way, only thing I can think of as being a possible issue is the pause instruction killing (or reducing) out of order/pipelined execution of the rest of the loop body. But then I don't know if this is such an issue as the loop exit will probably have a branch misprediction anyway, killing whatever is in the pipeline.
At any rate, would be interesting to see an explanation.
Cheers
Sent from my phone
On Aug 29, 2012 8:23 AM, "Vitaly Davidovich" <vitalyd at gmail.com> wrote:
I'm actually curious to know if Eric can explain a bit more why pause is an issue here, possibly with some benchmark results.
David's point earlier was that he doesn't think there's benefit to it in the way hotspot spins, but removing pause implies it can actually do harm rather than simply being unhelpful.
I'm also assuming this is not AMD specific but Intel as well?
Thanks
Sent from my phone
On Aug 29, 2012 8:04 AM, "Peter Levart" <peter.levart at marand.si> wrote:
Here's an interesting explanation about the impact of pause instruction in spin-wait loops:
http://software.intel.com/en-us/articles/long-duration-spin-wait-loops-on-hyper-threading-technology-enabled-intel-processors/
4 years later: It may be that newer CPUs are more clever now.
Regards, Peter
On Tuesday, August 28, 2012 11:36:23 AM Eric Caspole wrote:
> Hi everybody,
> I have made a webrev making a one-line change to remove use of PAUSE
> in linux x64. This will bring linux into sync with windows where
> SpinPause is just "return 0" as Dan indicates below.
>
> http://cr.openjdk.java.net/~ecaspole/nopause/webrev.00/webrev/
>
> We find that it is better not to use PAUSE in this kind of spin
> routine. Apparently someone discovered that on windows x64 years ago.
> Thanks,
> Eric
>
> On Aug 16, 2012, at 5:07 PM, Daniel D. Daugherty wrote:
> > On Win64, SpinPause() has been "return 0" since mid-2005. Way back
> >
> > when Win64 code was in os_win32_amd64.cpp:
> > SCCS/s.os_win32_amd64.cpp:
> >
> > D 1.9.1.1 05/07/04 03:20:45 dice 12 10 00025/00000/00334
> > MRs:
> > COMMENTS:
> > 5030359 -- back-end synchonization improvements - adaptive
> >
> > spinning, etc
> >
> > When the i486 and amd64 cpu dirs were merged back in 2007, the code
> >
> > became like it is below (#ifdef'ed):
> > D 1.32 07/09/17 09:11:33 sgoldman 37 35 00264/00008/00218
> > MRs:
> > COMMENTS:
> > 5108146 Merge i486 and amd64 cpu directories.
> > Macro-ized register names. Inserted amd64 specific code.
> >
> > Looks like on Linux-X64, the code has used the PAUSE instruction
> >
> > since mid-2005:
> > D 1.3 05/07/04 03:14:09 dice 4 3 00031/00000/00353
> > MRs:
> > COMMENTS:
> > 5030359 -- back-end synchonization improvements - adaptive
> >
> > spinning, etc
> >
> > We'll have to see if Dave Dice remember why he implemented
> > it this way...
> >
> > Dan
> >
> > On 8/16/12 12:01 PM, Eric Caspole wrote:
> >> Hi everybody,
> >> Does anybody know the reason why SpinPause is simply "return 0" on
> >> Win64 but uses PAUSE on Linux in a .s file?
> >> We would like to remove PAUSE from linux too.
> >>
> >> Thanks,
> >> Eric
> >>
> >>
> >> ./src/os_cpu/windows_x86/vm/os_windows_x86.cpp
> >>
> >> 548 extern "C" int SpinPause () {
> >> 549 #ifdef AMD64
> >> 550 return 0 ;
> >> 551 #else
> >> 552 // pause == rep:nop
> >> 553 // On systems that don't support pause a rep:nop
> >> 554 // is executed as a nop. The rep: prefix is ignored.
> >> 555 _asm {
> >> 556 pause ;
> >> 557 };
> >> 558 return 1 ;
> >> 559 #endif // AMD64
> >> 560 }
> >>
> >> src/os_cpu/linux_x86/vm/linux_x86_64.s
> >>
> >> 63 .globl SpinPause
> >> 64 .align 16
> >> 65 .type SpinPause, at function
> >> 66 SpinPause:
> >> 67 rep
> >> 68 nop
> >> 69 movq $1, %rax
> >> 70 ret
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/attachments/20120829/11f98962/attachment-0001.html
More information about the hotspot-runtime-dev
mailing list