RFR: 8224974: Implement JEP 352

Tue Aug 6 12:12:48 UTC 2019

Hi Aleksey/Boris,

This is a response to both your last review posts. New webrev link is at
the end.

On 01/08/2019 12:16, Aleksey Shipilev wrote:
> On 7/31/19 12:55 PM, Andrew Dinn wrote:
  . . .
> I am more concerned that the writeback call enters the pre sync stub unnecessarily.

The stub? I hope you mean when executing the native call as opposed to
the JITted intrinsic? The stub is only called on a cold path when a
native call proper happens. By contrast, the intrinsic translation for
CacheWBPreSync on AArch64 and x86_64 does not insert any instructions
into the generated code (and most especially not a call to the stub).
Here it is:

  instruct cacheWBPreSync()
  %{
    predicate(VM_Version::supports_data_cache_line_flush());
    match(CacheWBPreSync);

    ins_cost(100);
    format %{"cache wb presync" %}
    ins_encode %{
      __ cache_wbsync(true);
    %}
    ins_pipe(pipe_slow); // XXX
  %}

  void MacroAssembler::cache_wbsync(bool is_pre)
  {
    assert(VM_Version::supports_clflush(), "clflush should be available");
    bool optimized = VM_Version::supports_clflushopt();
    bool no_evict = VM_Version::supports_clwb();

    // pick the correct implementation

    if (!is_pre && (optimized || no_evict)) {
      // need an sfence for post flush when using clflushopt or clwb
      // otherwise no no need for any synchroniaztion

      sfence();
    }
  }

> I had the idea to do this more efficiently, and simplify code at the same time: how about emitting
> CacheWBPreSync nodes, but emitting nothing for them in .ad matchers? That would leave generic code
> generic, and architectures would then be able to avoid the stub altogether for pre sync code. This
> would simplify current stub generators too, I think: you don't need to pass arguments to them.

I believe the intrinsic behaviour you are asking for is effectively what
is implemented (as shown above). The .ad match rules for the PreSync and
PostSync nodes both call MacroAssembler::cache_wbsync. For pre-sync it
emits nothing. For post-sync it emits sfence when the writeback is
implemented using clwb or clflushopt and nothing if writeback relies on
clflush.

> This leaves calling via Unsafe. I believe pulling up the isPre choice to the stub generation time
> would be beneficial. That is, generate *two* stubs: StubRoutines::data_cache_writeback_pre_sync()
> and StubRoutines::data_cache_writeback_post_sync(). If arch does not need the pre_sync, generate nop
> version of pre_sync().

I don't really see any point to doing this. We can generate two stubs,
one executing a nop and one executing a nop/sfence according to need. Or
we can have one stub with a branch on the sync type and branch targets
that execute either a nop or a nop/sfence as needed. The difference in
performance of the stub is minor and irrelevant. The difference in
generation time and memory use is minor and irrelevant. What are you
trying to gain here?

> This is not a strong requirement from my side. I do believe it would make code a bit more
> straight-forward.

Am I missing something here? Or did you simply miss that the intrinsic
translation inserts no code for the presync?

>>> === src/hotspot/cpu/x86/assembler_x86.cpp
>>>
>>> It feels like these comments are redundant, especially L8630 and L8646 which mention magic values
>>> "6" and "7", not present in the code:
> 
> ...
> 
>> 8624   // 0x66 is instruction prefix
>>
>> 8627   // 0x0f 0xAE is opcode family
>>
>> 8630   // rdi == 7 is extended opcode byte
>> . . .
>>
>> Given that the code is simply stuffing numbers (whether supplied as
>> literals or as symbolic constants) into a byte stream I think these
>> comments are a help when it comes to cross-checking each specific
>> assembly against the corresponding numbers declared in the Intel
>> manuals. So, I don't really want to remove them. Would you prefer me to
>> reverse the wording as above?
> 
> I was merely commenting on the style: the rest of the file does not have comments like that. The
> positions of prefixes, opcode families, etc is kinda implied by the code shape.

Yes, I too noticed that the rest of the file does not have any such
comments :-]

Given the highly variable shape of x86 machine code, I don't see any
reason not to start remedying that omission, even if the remedy is only
piecemeal. Commenting may not be a great help to maintainers who know
the code and ISA really well but they are not the only audience. Even in
that specific case the comments provide a sanity check.

>>> === src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>   // prefer clwb (potentially parallel writeback without evict)
>>   // otherwise prefer clflushopt (potentially parallel writeback
>>   // with evict)
>>   // otherwise fallback on clflush (serial writeback with evict)
>>
>> In the second case the comment is redundant because the need for an
>> sfence is covered by the existing comment inside the if:
>>
>>     // need an sfence for post flush when using clflushopt or clwb
>>     // otherwise no no need for any synchroniaztion
> 
> Yes, this would be good to add.

Ok, done.

>>> === src/hotspot/share/opto/c2compiler.cpp
>>>
>>> Why inject new cases here, instead of at the end of switch? Saves sudden "break":
>>>
>>>  578     break;
>>>  579   case vmIntrinsics::_writeback0:
>>>  580     if (!Matcher::match_rule_supported(Op_CacheWB)) return false;
>>>  581     break;
>>>  582   case vmIntrinsics::_writebackPreSync0:
>>>  583     if (!Matcher::match_rule_supported(Op_CacheWBPreSync)) return false;
>>>  584     break;
>>>  585   case vmIntrinsics::_writebackPostSync0:
>>>  586     if (!Matcher::match_rule_supported(Op_CacheWBPostSync)) return false;
>>>  587     break;
>>
>> I placed them here so they were close to the other Unsafe intrinsics. In
>> particular they precede _allocateInstance, an ordering which is also the
>> case in the declarations in vmSymbols.hpp.
>>
>> In what sense do you mean that an extra 'break' is saved? That would be
>> true as regards the textual layout. It wouldn't affect the logic of
>> folding different ranges of values into branching range tests (which is
>> only determined by the numeric values of the intrinsics). If you are
>> concerned about the former then I would argue that placing the values in
>> declaration order seems to me to be the more important concern.
> 
> I don't think we have to follow whatever ordering mess in vmSymbols.hpp. New code cuts into the last
> case block in that switch, which is mostly about "we know about these symbols, they are
> falling-through to the break". Adding cases with Matcher::match_rule_supported seems odd there. If
> anything, those new cases should be moved upwards to other cases, e.g. after vmIntrinsics::_minD.

As you wish. I have moved them to immediately preceding the large
unbroken block.

>>> === src/hotspot/share/prims/unsafe.cpp
>>>
>>> Do we really need this function pointer mess?
>>>
>>>  457   void (*wb)(void *);
>>>  458   void *a = addr_from_java(line);
>>>  459   wb = (void (*)(void *)) StubRoutines::data_cache_writeback();
>>>  460   assert(wb != NULL, "generate writeback stub!");
>>>  461   (*wb)(a);
>>>
>>> Seems easier to:
>>>
>>>  assert(StubRoutines::data_cache_writeback() != NULL, "sanity");
>>>  StubRoutines::data_cache_writeback()(addr_from_java(line));
>> Hmm, "that whole brevity thing" again? Well, I guess you must now call
>> me "El Duderino".
> 
> Well, that is, like, your opinion, man. C++ is messy if we allow it to be!
> 
>> Are you sure that all the compilers used to build openJDK will happily
>> eat the second line of your replacement? If you can guarantee that I'll
>> happily remove the type declarations.
> 
> I think they do: there are uses like that in the same file already, for example:
> 
>   if (StubRoutines::unsafe_arraycopy() != NULL) {
>     StubRoutines::UnsafeArrayCopy_stub()(src, dst, sz);
>   } else {
>     Copy::conjoint_memory_atomic(src, dst, sz);
>   }

Hmm, your suggested replacement does not in fact compile. Indeed, the
example you cite is comparing apples with pears -- note that the getter
employed in call differs from the getter employed in the check. The
following macro magic is provided for that example in stubRoutines.hpp:

  static address unsafe_arraycopy()     { return _unsafe_arraycopy; }
  typedef void (*UnsafeArrayCopyStub)(const void* src, void* dst, size_t
count);
  static UnsafeArrayCopyStub UnsafeArrayCopy_stub()         { return
CAST_TO_FN_PTR(UnsafeArrayCopyStub,  _unsafe_arraycopy); }

I have provided similar magic to hide the function pointer details for
the writeback and writeback_sync stubs.

Latest webrev against current jdk/jdk including

  changes agreed for the review discussion above
  fixes for AArch64/x86_32 build issues reported by Boris Ulasevich
  the new test promised for Boris Ulasevich

  http://cr.openjdk.java.net/~adinn/8224974/webrev.11

Testing:

The x86_32 and aarch64 builds now build and run ok.

The pmem-specific tests (PmemTest, MapFail) pass with the expected
outcomes on x86_64 and AArch64.

PmemTest is skipped on x86_32 (as expected). MapFail passes on x86_32
(it expects the map to be unsupported).

I have passed the patch on for more thorough testing on simulated and
real NVRAM using our middleware stack.

I am still waiting for confirmation of a submit job.

regards,

Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander