RFR(L): 8180032: Unaligned pointer dereference in ClassFileParser

Thu May 25 22:00:38 UTC 2017

I’ve been spending the last few days going down a rabbit hole of what turned out to be a totally unrelated performance issue. Long story short: startup time is affected, in some cases significantly, by the length of the path to the JDK in the file system. More on that in a separate thread/at another time.

After having looked at generated code, and having run benchmarks stressing class loading/startup time my conclusion is that this change is performance neutral. For example, the alignment check introduced in bytes_x86.hpp get_native/put_native collapses down to a single unconditional load unless, of course, it’s done in a loop in which case it gets unrolled+vectorized.

I also ran hs-tier2, which should more than cover the changes in question, and there were no failures.

With that in mind I would like to push the change in its current form[1] and handle a few things as follow-up work (roughly in order):

* Introduce typedefs in classFileParser for potentially unaligned pointer types
* Always using memcpy to do the read - need to investigate how the primitives are used wrt. tearing
* Unify the Bytes::* impl across platforms - need to investigate/verify the implications on performance

Reasonable?

Cheers,
Mikael

[1] http://cr.openjdk.java.net/~mikael/webrevs/8180032/webrev.01/hotspot/webrev/ <http://cr.openjdk.java.net/~mikael/webrevs/8180032/webrev.01/hotspot/webrev/>
> On May 18, 2017, at 5:18 PM, David Holmes <david.holmes at oracle.com> wrote:
> 
> On 19/05/2017 9:19 AM, Mikael Vidstedt wrote:
>> 
>>> On May 18, 2017, at 3:50 PM, David Holmes <david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>> wrote:
>>> 
>>> Hi Mikael,
>>> 
>>> On 19/05/2017 8:15 AM, Mikael Vidstedt wrote:
>>>> 
>>>>> On May 18, 2017, at 2:59 AM, Robbin Ehn <robbin.ehn at oracle.com
>>>>> <mailto:robbin.ehn at oracle.com>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> On 05/17/2017 03:46 AM, Kim Barrett wrote:
>>>>>>> On May 9, 2017, at 6:40 PM, Mikael Vidstedt
>>>>>>> <mikael.vidstedt at oracle.com <mailto:mikael.vidstedt at oracle.com>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Warning: It may be wise to stock up on coffee or tea before
>>>>>>> reading this.
>>>>>>> 
>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8180032
>>>>>>> Webrev:
>>>>>>> http://cr.openjdk.java.net/~mikael/webrevs/8180032/webrev.00/hotspot/webrev/
>>>>>>> <http://cr.openjdk.java.net/~mikael/webrevs/8180032/webrev.00/hotspot/webrev/>
>>>>>> Not a review, just a question.
>>>>>> ------------------------------------------------------------------------------
>>>>>> src/cpu/x86/vm/bytes_x86.hpp
>>>>>> 40   template <typename T>
>>>>>> 41   static inline T get_native(const void* p) {
>>>>>> 42     assert(p != NULL, "null pointer");
>>>>>> 43
>>>>>> 44     T x;
>>>>>> 45
>>>>>> 46     if (is_ptr_aligned(p, sizeof(T))) {
>>>>>> 47       x = *(T*)p;
>>>>>> 48     } else {
>>>>>> 49       memcpy(&x, p, sizeof(T));
>>>>>> 50     }
>>>>>> 51
>>>>>> 52     return x;
>>>>>> I'm looking at this and wondering if there's a good reason to not just
>>>>>> unconditionally use memcpy here.  gcc -O will generate a single move
>>>>>> instruction for that on x86_64.  I'm not sure what happens on 32bit
>>>>>> with an 8 byte value, but I suspect it will do something similarly
>>>>>> sensible, e.g. 2 4 byte memory to memory transfers.
>>>>> 
>>>>> Unconditionally memcpy would be nice!
>>>>> 
>>>>> Are going to look into that Mikael?
>>>> 
>>>> It’s complicated…
>>>> 
>>>> We may be able to switch, but there is (maybe) a subtle reason why
>>>> the alignment check is in there: to avoid word tearing..
>>>> 
>>>> Think of two threads racing:
>>>> 
>>>> * thread 1 is writing to the memory location X
>>>> * thread 2 is reading from the same memory location X
>>>> 
>>>> Will thread 2 always see a consistent value (either the original
>>>> value or the fully updated value)?
>>> 
>>> We're talking about internal VM load and stores rights? For those we
>>> need to use appropriate atomic routine if there are potential races.
>>> But we should never be mixing these kind of accesses with Java level
>>> field accesses - that would be very broken.
>> 
>> That seems reasonable, but for my untrained eye it’s not trivially true
>> that relaxing the implementation is correct for all the uses of the
>> get/put primitives. I am therefore a bit reluctant to do so without
>> understanding the implications.
> 
> If a Copy routine doesn't have Atomic in its name then I don't expect atomicity. Even then unaligned accesses are not atomic even in the Atomic routine!
> 
> But I'm not clear exactly how all these routines get used.
> 
>>> For classFileparser we should no concurrency issues.
>> 
>> That seems reasonable. What degree of certainty does your “should” come
>> with? :)
> 
> Pretty high. We're parsing a stream of bytes and writing values into local structures that will eventually be passed across to a klass instance, which in turn will eventually be published via the SD as a loaded class. The actual parsing phase is purely single-threaded.
> 
> David
> 
>> Cheers,
>> Mikael
>> 
>>> 
>>> David
>>> 
>>>> In the unaligned/memcpy case I think we can agree that there’s
>>>> nothing preventing the compiler from doing individual loads/stores of
>>>> the bytes making up the data. Especially in something like slowdebug
>>>> that becomes more or less obvious - memcpy most likely isn’t
>>>> intrinsified and is quite likely just copying a byte at a time. Given
>>>> that the data is, in fact, unaligned, there is really no simple way
>>>> to prevent word tearing, so I’m pretty sure that we never depend on
>>>> it - if needed, we’re likely to already have some higher level
>>>> synchronization in place guarding the accesses. And the fact that the
>>>> other, non-x86 platforms already do individual byte loads/stores when
>>>> the pointer is unaligned indicates is a further indication that
>>>> that’s the case.
>>>> 
>>>> However, the aligned case is where stuff gets more interesting. I
>>>> don’t think the C/C++ spec guarantees that accessing a memory
>>>> location using a pointer of type T will result in code which does a
>>>> single load/store of size >= sizeof(T), but for all the compilers we
>>>> *actually* use that’s likely to be the case. If it’s true that the
>>>> compilers don’t splits the memory accesses, that means we won’t have
>>>> word tearing when using the Bytes::get/put methods with *aligned*
>>>> pointers.
>>>> 
>>>> If I switch to always using memcpy, there’s a risk that it introduces
>>>> tearing problems where earlier we had none. Two questions come to mind:
>>>> 
>>>> * For the cases where the get/put methods get used *today*, is that a
>>>> problem?
>>>> * What happens if somebody in the *future* decides that put_Java_u4
>>>> seems like a great thing to use to write to a Java int field on the
>>>> Java heap, and a Java thread is racing to read that same data?
>>>> 
>>>> 
>>>> All that said though, I think this is worth exploring and it may well
>>>> turn out that word tearing really isn’t a problem. Also, I believe
>>>> there may be opportunities to further clean up this code and perhaps
>>>> unify it a bit across the various platforms.
>>>> 
>>>> And *that* said, I think the change as it stands is still an
>>>> improvement, so I’m leaning towards pushing it and filing an
>>>> enhancement and following up on it separately. Let me know if you
>>>> strongly feel that this should be looked into and addressed now and I
>>>> may reconsider :)
>>>> 
>>>> Cheers,
>>>> Mikael
>>>> 
>>>>> 
>>>>> /Robbin
>>>>> 
>>>>>> ------------------------------------------------------------------------------
>>