Some questions on intrinsic for UTF8 to UTF16 decoding
Vladimir Kozlov
vladimir.kozlov at oracle.com
Mon Nov 23 17:01:35 UTC 2020
On 11/23/20 5:20 AM, Ludovic Henry wrote:
> Hi Vladimir,
>
>> The simpliest solution is a new int[] array which holds locals var values which intrinsic can update and return.
>
> Good idea, I'll try that and report back.
>
>> An other solution (and more complex code in library_call.cpp) is to pass `src' and 'dst' to intrinsic and read/update their fields inside:
>
> I was thinking about that, but I'm not sure what it would look like. Could you just please point me to a place where something similar is already done?
There is method in library_call.cpp to load fields: load_field_from_object(). It can be used to load fields from
'ByteBuffer src' and 'CharBuffer dst' to calculate start and length as it is done at the beginning of decodeArrayLoop().
But unfortunately we don't have similar methods for stores. But you can look on code in
LibraryCallKit::inline_unsafe_access() to create one - this is why I said it is more complicated.
>
>> Yes, 2Mb is too much. And the problem is not size but affect on startup time - it is calculated dynamically.
>
> I agree, the impact on the startup time would be the most impactful. I was thinking over the weekend whether I could generate this data statically at compile-time (with something along the line of template metaprogramming).
>
> More generally, what is the thought process in this kind of optimization, like what metrics do we favor? I am going to spend some time on this intrinsic this week to get some numbers. I'll come back to you on this discussion.
It is all about cost vs benefits. If we can get, say, > 10% performance improvement across a broad set of applications
then we can accept increase in memory size and affect on startup. But if it is in range < 2% performance - it is hard to
accept. That is why we always first investigate what Java method can be intrinsified with simple changes to get big
improvements which will help a lot of applications. And also look if we can add new general optimization to JIT compiler
instead and get improvement not only in particular method but in others too.
That is why we ask to show performance numbers first before we accept such new code.
>
>> An other issue with such intrinsic I see is that decodeArrayLoop() code has a lot of checks for malformed strings which intrinsic does not have. Most likely it will not pass JCK testing.
>
> Yes, that is simply because I didn't implement it yet. The plan is to decode as much as possible in the intrinsic, and if it runs into any case that it doesn't know how to handle, it backs off and falls back to the current implementation.
>
>> Would be interesting to see performance if you vectorize only ASCII copy loop which seems most common case and you don't need table:
>
> That could definitely be possible. However, I'm not sure whether ASCII only loop is really the most common case. Moreover, the way the intrinsic is done here, if at any point a non-ASCII character is encountered, it backs off and falls back to the loop-based algorithm. We could however mix and match the ASCII-only case with the more general loop with something along the line of:
Did you see that it is auto-vectorised? Manually unrolled Java loop are not good for JIT compilation. I am not surprise
you don't see any benefits with it.
I was thinking to make it as separate method and write intinsic for it with vector compare instructions. Or use Vector
API as you suggested.
>
> ```
> while (sp < sl) {
> if (sp + 7 < sl && dp + 7 < dl) {
> long b = unsafe.getLong(sa, Unsafe.ARRAY_BYTE_BASE_OFFSET + sp * Unsafe.ARRAY_BYTE_INDEX_SCALE);
> if ((b & 0xC0C0C0C0C0C0C0C0) != 0x8080808080808080) {
> da[dp + 0] = (char)sa[sp+0];
> da[dp + 1] = (char)sa[sp+1];
> da[dp + 2] = (char)sa[sp+2];
> da[dp + 3] = (char)sa[sp+3];
> da[dp + 4] = (char)sa[sp+4];
> da[dp + 5] = (char)sa[sp+5];
> da[dp + 6] = (char)sa[sp+6];
> da[dp + 7] = (char)sa[sp+7];
> sp += 8;
> dp += 8;
> continue;
> }
> }
> int b1 = sa[sp];
> if (b1 >= 0) {
> ...
> ```
>
> From preliminary results, the gain is not substantial on ASCII-only text, and is minimal on `"Héllo World!".repeat(1000 * 1000)`.
I will bet for this case your intrinsic will not show benefit too - string is too short.
Thanks,
Vladimir
>
> I'll check whether we can get some gains with the Vector API on that case.
>
> Thank you for your feedback.
> Ludovic
>
>
More information about the hotspot-compiler-dev
mailing list