Some questions on intrinsic for UTF8 to UTF16 decoding
Ludovic Henry
luhenry at microsoft.com
Mon Nov 23 13:20:57 UTC 2020
Hi Vladimir,
> The simpliest solution is a new int[] array which holds locals var values which intrinsic can update and return.
Good idea, I'll try that and report back.
> An other solution (and more complex code in library_call.cpp) is to pass `src' and 'dst' to intrinsic and read/update their fields inside:
I was thinking about that, but I'm not sure what it would look like. Could you just please point me to a place where something similar is already done?
> Yes, 2Mb is too much. And the problem is not size but affect on startup time - it is calculated dynamically.
I agree, the impact on the startup time would be the most impactful. I was thinking over the weekend whether I could generate this data statically at compile-time (with something along the line of template metaprogramming).
More generally, what is the thought process in this kind of optimization, like what metrics do we favor? I am going to spend some time on this intrinsic this week to get some numbers. I'll come back to you on this discussion.
> An other issue with such intrinsic I see is that decodeArrayLoop() code has a lot of checks for malformed strings which intrinsic does not have. Most likely it will not pass JCK testing.
Yes, that is simply because I didn't implement it yet. The plan is to decode as much as possible in the intrinsic, and if it runs into any case that it doesn't know how to handle, it backs off and falls back to the current implementation.
> Would be interesting to see performance if you vectorize only ASCII copy loop which seems most common case and you don't need table:
That could definitely be possible. However, I'm not sure whether ASCII only loop is really the most common case. Moreover, the way the intrinsic is done here, if at any point a non-ASCII character is encountered, it backs off and falls back to the loop-based algorithm. We could however mix and match the ASCII-only case with the more general loop with something along the line of:
```
while (sp < sl) {
if (sp + 7 < sl && dp + 7 < dl) {
long b = unsafe.getLong(sa, Unsafe.ARRAY_BYTE_BASE_OFFSET + sp * Unsafe.ARRAY_BYTE_INDEX_SCALE);
if ((b & 0xC0C0C0C0C0C0C0C0) != 0x8080808080808080) {
da[dp + 0] = (char)sa[sp+0];
da[dp + 1] = (char)sa[sp+1];
da[dp + 2] = (char)sa[sp+2];
da[dp + 3] = (char)sa[sp+3];
da[dp + 4] = (char)sa[sp+4];
da[dp + 5] = (char)sa[sp+5];
da[dp + 6] = (char)sa[sp+6];
da[dp + 7] = (char)sa[sp+7];
sp += 8;
dp += 8;
continue;
}
}
int b1 = sa[sp];
if (b1 >= 0) {
...
```
From preliminary results, the gain is not substantial on ASCII-only text, and is minimal on `"Héllo World!".repeat(1000 * 1000)`.
I'll check whether we can get some gains with the Vector API on that case.
Thank you for your feedback.
Ludovic
More information about the hotspot-compiler-dev
mailing list