RFR(L): 8180032: Unaligned pointer dereference in ClassFileParser

Fri May 26 02:03:28 UTC 2017

On 26/05/2017 8:00 AM, Mikael Vidstedt wrote:
> I’ve been spending the last few days going down a rabbit hole of what 
> turned out to be a totally unrelated performance issue. Long story 
> short: startup time is affected, in some cases significantly, by the 
> length of the path to the JDK in the file system. More on that in a 
> separate thread/at another time.

https://bugs.openjdk.java.net/browse/JDK-7196911 ?

> After having looked at generated code, and having run benchmarks 
> stressing class loading/startup time my conclusion is that this change 
> is performance neutral. For example, the alignment check introduced in 
> bytes_x86.hpp get_native/put_native collapses down to a single 
> unconditional load unless, of course, it’s done in a loop in which case 
> it gets unrolled+vectorized.
> 
> I also ran hs-tier2, which should more than cover the changes in 
> question, and there were no failures.
> 
> With that in mind I would like to push the change in its current form[1] 
> and handle a few things as follow-up work (roughly in order):
> 
> * Introduce typedefs in classFileParser for potentially unaligned 
> pointer types
> * Always using memcpy to do the read - need to investigate how the 
> primitives are used wrt. tearing
> * Unify the Bytes::* impl across platforms - need to investigate/verify 
> the implications on performance
> 
> Reasonable?

Reasonable.

Cheers,
David

> Cheers,
> Mikael
> 
> [1] 
> http://cr.openjdk.java.net/~mikael/webrevs/8180032/webrev.01/hotspot/webrev/
> 
>> On May 18, 2017, at 5:18 PM, David Holmes <david.holmes at oracle.com 
>> <mailto:david.holmes at oracle.com>> wrote:
>>
>> On 19/05/2017 9:19 AM, Mikael Vidstedt wrote:
>>>
>>>> On May 18, 2017, at 3:50 PM, David Holmes <david.holmes at oracle.com 
>>>> <mailto:david.holmes at oracle.com>
>>>> <mailto:david.holmes at oracle.com>> wrote:
>>>>
>>>> Hi Mikael,
>>>>
>>>> On 19/05/2017 8:15 AM, Mikael Vidstedt wrote:
>>>>>
>>>>>> On May 18, 2017, at 2:59 AM, Robbin Ehn <robbin.ehn at oracle.com 
>>>>>> <mailto:robbin.ehn at oracle.com>
>>>>>> <mailto:robbin.ehn at oracle.com>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 05/17/2017 03:46 AM, Kim Barrett wrote:
>>>>>>>> On May 9, 2017, at 6:40 PM, Mikael Vidstedt
>>>>>>>> <mikael.vidstedt at oracle.com <mailto:mikael.vidstedt at oracle.com> 
>>>>>>>> <mailto:mikael.vidstedt at oracle.com>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Warning: It may be wise to stock up on coffee or tea before
>>>>>>>> reading this.
>>>>>>>>
>>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8180032
>>>>>>>> Webrev:
>>>>>>>> http://cr.openjdk.java.net/~mikael/webrevs/8180032/webrev.00/hotspot/webrev/
>>>>>>>> <http://cr.openjdk.java.net/~mikael/webrevs/8180032/webrev.00/hotspot/webrev/>
>>>>>>> Not a review, just a question.
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> src/cpu/x86/vm/bytes_x86.hpp
>>>>>>> 40   template <typename T>
>>>>>>> 41   static inline T get_native(const void* p) {
>>>>>>> 42     assert(p != NULL, "null pointer");
>>>>>>> 43
>>>>>>> 44     T x;
>>>>>>> 45
>>>>>>> 46     if (is_ptr_aligned(p, sizeof(T))) {
>>>>>>> 47       x = *(T*)p;
>>>>>>> 48     } else {
>>>>>>> 49       memcpy(&x, p, sizeof(T));
>>>>>>> 50     }
>>>>>>> 51
>>>>>>> 52     return x;
>>>>>>> I'm looking at this and wondering if there's a good reason to not 
>>>>>>> just
>>>>>>> unconditionally use memcpy here.  gcc -O will generate a single move
>>>>>>> instruction for that on x86_64.  I'm not sure what happens on 32bit
>>>>>>> with an 8 byte value, but I suspect it will do something similarly
>>>>>>> sensible, e.g. 2 4 byte memory to memory transfers.
>>>>>>
>>>>>> Unconditionally memcpy would be nice!
>>>>>>
>>>>>> Are going to look into that Mikael?
>>>>>
>>>>> It’s complicated…
>>>>>
>>>>> We may be able to switch, but there is (maybe) a subtle reason why
>>>>> the alignment check is in there: to avoid word tearing..
>>>>>
>>>>> Think of two threads racing:
>>>>>
>>>>> * thread 1 is writing to the memory location X
>>>>> * thread 2 is reading from the same memory location X
>>>>>
>>>>> Will thread 2 always see a consistent value (either the original
>>>>> value or the fully updated value)?
>>>>
>>>> We're talking about internal VM load and stores rights? For those we
>>>> need to use appropriate atomic routine if there are potential races.
>>>> But we should never be mixing these kind of accesses with Java level
>>>> field accesses - that would be very broken.
>>>
>>> That seems reasonable, but for my untrained eye it’s not trivially true
>>> that relaxing the implementation is correct for all the uses of the
>>> get/put primitives. I am therefore a bit reluctant to do so without
>>> understanding the implications.
>>
>> If a Copy routine doesn't have Atomic in its name then I don't expect 
>> atomicity. Even then unaligned accesses are not atomic even in the 
>> Atomic routine!
>>
>> But I'm not clear exactly how all these routines get used.
>>
>>>> For classFileparser we should no concurrency issues.
>>>
>>> That seems reasonable. What degree of certainty does your “should” come
>>> with? :)
>>
>> Pretty high. We're parsing a stream of bytes and writing values into 
>> local structures that will eventually be passed across to a klass 
>> instance, which in turn will eventually be published via the SD as a 
>> loaded class. The actual parsing phase is purely single-threaded.
>>
>> David
>>
>>> Cheers,
>>> Mikael
>>>
>>>>
>>>> David
>>>>
>>>>> In the unaligned/memcpy case I think we can agree that there’s
>>>>> nothing preventing the compiler from doing individual loads/stores of
>>>>> the bytes making up the data. Especially in something like slowdebug
>>>>> that becomes more or less obvious - memcpy most likely isn’t
>>>>> intrinsified and is quite likely just copying a byte at a time. Given
>>>>> that the data is, in fact, unaligned, there is really no simple way
>>>>> to prevent word tearing, so I’m pretty sure that we never depend on
>>>>> it - if needed, we’re likely to already have some higher level
>>>>> synchronization in place guarding the accesses. And the fact that the
>>>>> other, non-x86 platforms already do individual byte loads/stores when
>>>>> the pointer is unaligned indicates is a further indication that
>>>>> that’s the case.
>>>>>
>>>>> However, the aligned case is where stuff gets more interesting. I
>>>>> don’t think the C/C++ spec guarantees that accessing a memory
>>>>> location using a pointer of type T will result in code which does a
>>>>> single load/store of size >= sizeof(T), but for all the compilers we
>>>>> *actually* use that’s likely to be the case. If it’s true that the
>>>>> compilers don’t splits the memory accesses, that means we won’t have
>>>>> word tearing when using the Bytes::get/put methods with *aligned*
>>>>> pointers.
>>>>>
>>>>> If I switch to always using memcpy, there’s a risk that it introduces
>>>>> tearing problems where earlier we had none. Two questions come to mind:
>>>>>
>>>>> * For the cases where the get/put methods get used *today*, is that a
>>>>> problem?
>>>>> * What happens if somebody in the *future* decides that put_Java_u4
>>>>> seems like a great thing to use to write to a Java int field on the
>>>>> Java heap, and a Java thread is racing to read that same data?
>>>>>
>>>>>
>>>>> All that said though, I think this is worth exploring and it may well
>>>>> turn out that word tearing really isn’t a problem. Also, I believe
>>>>> there may be opportunities to further clean up this code and perhaps
>>>>> unify it a bit across the various platforms.
>>>>>
>>>>> And *that* said, I think the change as it stands is still an
>>>>> improvement, so I’m leaning towards pushing it and filing an
>>>>> enhancement and following up on it separately. Let me know if you
>>>>> strongly feel that this should be looked into and addressed now and I
>>>>> may reconsider :)
>>>>>
>>>>> Cheers,
>>>>> Mikael
>>>>>
>>>>>>
>>>>>> /Robbin
>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>
>