[aarch64-port-dev ] [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives
Stuart Monteith
stuart.monteith at linaro.org
Mon Aug 14 14:03:52 UTC 2017
Hello,
Please find below hyperlinks to the jmh results - the graphs show the
performance relative to the "steam" method - compiled by C2. There is
an improvement for all platforms. With 100,000 bytes there is no
improvement, but that is an unlikely circumstance.
http://people.linaro.org/~stuart.monteith/hasneg-last/hasnegA-last.svg
http://people.linaro.org/~stuart.monteith/hasneg-last/hasnegB-last.svg
http://people.linaro.org/~stuart.monteith/hasneg-last/hasnegC-last.svg
BR,
Stuart
On 14 August 2017 at 11:47, Stuart Monteith <stuart.monteith at linaro.org> wrote:
> Thanks Dmitrij,
> I'll look at what you've done and try your patch on my machines.
>
> BR,
> Stuart
>
> On 11 August 2017 at 18:30, Dmitrij Pochepko
> <dmitrij.pochepko at bell-sw.com> wrote:
>> Hi,
>>
>> please review a new version of this RFR [1] which is significantly
>> re-worked.
>>
>>
>> Changes compared to original posting:
>>
>> - 2 versions of hasNegatives intrinsic were merged, which result in good
>> performance for both small and large array.
>>
>> - large array case and "at-the-end-of-mem-page" case were moved to stub to
>> save code cache and help register allocator
>>
>>
>> Raw performance numbers for the original hasNegativesBench.loopingFastMethod
>> [2] are here[3] and accompanied by updated comparison charts for Raspberry
>> Pi 3 [4] and ThunderX T88 [5]. In short, intrinsified hasNegatives is x4
>> faster on T88 and x2.5 on R-Pi for 31 byte array and up to 8 times faster on
>> large arrays.
>>
>> I've also created small and simple benchmark [6] which demonstrates
>> performance difference for string constructor for strings without negative
>> byte values. Raw results [7] shows significantly increased performance on
>> Thunder X T88. Results also can be seen on comparison charts [8]. Due to
>> large amount of allocations and gc this benchmark is not applicable for
>> R-Pi, which has 1GB system memory and sd-card as main drive.
>>
>>
>> This patch should be considered as patch with 2 contributors
>> (stuart.monteith at linaro.org and dmitrij.pochepko at bell-sw.com (openjdk login
>> dpochepk)). Also I'd like to thank Andrew Haley for early reviews and
>> consulting.
>>
>> No regressions were found via jtreg tests.
>>
>> Thanks,
>>
>> Dmitrij
>>
>>
>> [1] Webrev: http://cr.openjdk.java.net/~dpochepk/8184943/webrev.02/
>> [2] http://cr.openjdk.java.net/~aph/HasNegativesBench/
>> [3] http://cr.openjdk.java.net/~dpochepk/8184943/perf_numbers.txt
>> [4] http://cr.openjdk.java.net/~dpochepk/8184943/Cortex_A53_comparison.png
>> [5] http://cr.openjdk.java.net/~dpochepk/8184943/ThunderX_comparison.png
>> [6]
>> http://cr.openjdk.java.net/~dpochepk/8184943/StringConstructorBench.java
>> [7] http://cr.openjdk.java.net/~dpochepk/8184943/StringConstructorBench.txt
>> [8]
>> http://cr.openjdk.java.net/~dpochepk/8184943/ThunderX-StringConstructor.png
>>
>>
>> On 21.07.2017 11:26, Andrew Haley wrote:
>>>
>>> On 20/07/17 19:27, Dmitrij Pochepko wrote:
>>>>
>>>> Probably best way would be to merge large data loads from my patch and
>>>> Stuart's lightning-fast small arrays handling.
>>>
>>> Yes.
>>>
>>>> I'll be happy to merge these ideas in one intrinsic that works fastest
>>>> on small and large arrays if Stuart does not mind. I could use some help
>>>> testing the final solution on some of the HW we don't have. I don't mind
>>>> if Stuart want to merge it, then we'll help him with testing on h/w he
>>>> doesn't have.
>>>
>>> Have fun! The performance to care about is small strings (< 31 bytes)
>>> and,
>>> less commonly, very long ones. Super-fast handling of small strings is
>>> very important.
>>>
>>
More information about the aarch64-port-dev
mailing list