RFR: 8170785: Excessive allocation in ParseUtil.encodePath

Wed Jan 4 15:57:50 UTC 2017

Hi Claes,

ok, I didn't spot any bugs so its fine as is.

On 1/3/2017 5:52 PM, Claes Redestad wrote:
> Hi Roger,
>
> On 2017-01-03 22:22, Roger Riggs wrote:
>> Hi Claes,
>>
>> So Windows didn't suffer from having the '\' separator.
>
> relatively speaking Windows won't see any speed-up when dealing
> with file system paths, but most places where we see encodePath in
> profiles is actually not files: jar/runtime image paths, HTTP etc...
>
> Thus Windows shouldn't be left too far behind for typical workloads. :-)
>
>>
>> ParseUtil:
>>
>> firstEncodeIndex:121:
>>   'a' - 'z' seems more frequent than '/' or '.'; does it improve the
>> stats to move that range to the beginning of the if.
>>   (yes the compiler can re-order).
>>
>> line 125:
>>     Since 127 is known to need encoding it could be >= 0x007f
>>
>> line  136: I suppose the arraycopy intrinsic already optimizes length 
>> == 0;
>>
>> Line 134:  I question the math on  * 2 + 16 -index;  (But this is
>> pre-existing code)
>>    if there were lots of characters that needed encoding it might be
>> possible to overflow the array since 1 char is replaced by at least 3
>> and up to 9.
>>    16 seems like a questionable fudge factor; but perhaps it has not
>> been a problem in practice.
>
> As the main objective here is to get rid of the allocations while not
> regressing throughput when they can't be avoided I resisted *most*
> urges to micro-optimize. :-)
>
> The weird heuristic with len * 2 + 16 is ugly, yes, but I prefer to
> leave it mostly alone and perhaps revisit this for a throughput
> performance enhancement in the future.
>
> '/' can actually be rather frequent in paths, and while '.' is likely
> less common and should have been inserted after a-z, adding the check
> improved performance of encoding to-be-rather-common paths like
> "/java.base/bla/bla/bla/" remarkably while not visibly regressing
> anything else.
>
> What is clear from micros is that having to load and call into the
> BitSet has a relatively large overhead compared to an extra comparison
> (which is likely why there are separate checks for a-z, A-Z and 0-9 in
> the first place).
A good optimizer should be able to bisect and get the result efficiently 
given the constant values.
>
> I should actually see if we can't remove the BitSet entirely: I think
> the java.net.URI approach with final long masks and a simple shift and
> check might be more efficient and relatively easy to duplicate here.
A couple of fixed 64-bit compile time bitmasks computed from the set of 
encoded chars would be pretty efficient and compact too.
I did wonder if there was a way to have a common utility for the encoding;
but that's probably something to save for later.

Roger
>
> Thanks!
>
> /Claes
>
>>
>> $.02, Roger
>>
>>
>>
>> On 1/3/2017 9:46 AM, Claes Redestad wrote:
>>> Hi,
>>>
>>> some users reports high allocation rates in ParseUtil.encodePath,
>>> regardless of whether paths actually need to be encoded or not.
>>> Since this is a commonly used utility it makes sense.
>>>
>>> Webrev: http://cr.openjdk.java.net/~redestad/8170785/webrev.01
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8170785
>>>
>>> This patch provides a semantically neutral fast-path for cases when
>>> the path does not need to be encoded (up to 5x speedup), reduces
>>> allocation when the string has a prefix that does not need to be
>>> encoded (1-2x speedup) and no regression when using a separator
>>> that's not '/' or the first char needs encoding.
>>>
>>> Interpreted performance is not affected much either: small positive
>>> when no encoding is needed, neutral or negligible regression
>>> otherwise.
>>>
>>> Thanks!
>>>
>>> /Claes
>>