RFR: aarch32: softfp patch for template interpreter

Fri Oct 21 12:30:18 UTC 2016

Hi, Ed, thank you for your kindness. I make some modifications (see details below) according to your suggestions and will send a new patch with more specific description as soon as I finish the test. By the way the patch part of sharedRuntime_aarch32.cpp is totally missing, sorry for my mistake.

Best regards.

-----Original Message-----
>From: Edward Nevill [mailto:edward.nevill at gmail.com] 
>Sent: Wednesday, October 19, 2016 7:42 PM
>To: yangyongyong; aarch32-port-dev at openjdk.java.net
>Subject: Re: RFR: aarch32: softfp patch for template interpreter
>
>Hi yangyongyong,
>
>On Mon, 2016-10-17 at 12:53 +0000, yangyongyong wrote:
>>  
>> Attached is a softfp patch for template interpreter. Two pairs of macros are defined in globalDefinitions_aarch32.hpp to isolate softfp code: HARD_FLOAT_CC/ SOFT_FLOAT_CC and HARD_FLOAT/ SOFT_FLOAT.
>
>Attachments are stripped when sent to the OpenJDK mailing lists. If you want to post patches you should put them on a web site and post a link.
>
>I have put your patch at
>
>http://cr.openjdk.java.net/~enevill/softfp/161014-softfp.hg.diff
>
>so others can review it.
>
>I have looked through the patch and have a few comments.
>
>> HARD_FLOAT_CC/SOFT_FLOAT_CC defines C calling convention. HARD_FLOAT are used to indicate that a platform can take advantage of hardfp instructions regardless of its calling convention, while SOFT_FLOAT means hardfp instructions are not supported or are disallowed to use.
>>  
>> Only three combinations are possible configurations for ARM platforms:
>> 1. HARD_FLOAT/ HARD_FLOAT_CC are defined when __ARM_PCS_VFP is defined.
>> 2. HARD_FLOAT/ SOFT_FLOAT_CC are defined when gcc runs with "-mfloat-abi=softfp" option.
>> 3. SOFT_FLOAT/ SOFT_FLOAT_CC when gcc runs with "-mfloat-abi=soft" option.
>
>Some instructions on how to build the different variants would be useful.
>
>Should I define -mfloat-abi=softfp or -mfloat-abi=soft in CFLAGS, or CXXFLAGS or should I use the --with-extra-cflags=XXX, or --with-extra-cxxflags=XXX. Should I build with --with-jvm-variant=core, or will it work if I build with --with-jvm-variant=client and then use -Xint to run only the template interpreter.
>
>I can probably work this out myself, as can most readers here, but why should I? If you want people to build/test/review your patch give clear instructions on how to build with it.
>
>Some comments on the patch itself
>
>@@ -1733,6 +1742,18 @@
>   if(imm_instr(decode, Rd, Rn, imm, cond, s)) return;
>   if(imm_instr(cpart, Rd, Rn, -imm, cond, s)) return;
>
>+  if (Rd == Rn) {
>+    // TODO: may use 4 adds here, need optimization.
>+    int i;
>+    for (i = 0;i < 4;i++) {
>+      uint32_t imm8 = (imm & (0xFF << (i * 8)));
>+      if (imm8 == 0)
>+        continue;
>+      bool ret = imm_instr(decode, Rd, Rd, imm8, cond, s);
>+      assert(ret, "should be");
>+    }
>+    return;
>+  }
>
>
>This is patching the following in assembler_aarch32.cpp
>
>void Assembler::add_sub_imm(int decode, Register Rd, Register Rn, int imm,
>                            Condition cond, bool s) {
>
>I assume you have added this code because in places you do call add_sub_imm with immediates which cannot be handled in a single instruction and with Rd==Rn.
>
>However, this is not correct if the S bit is set. Eg.
>
>  SUBS	Rd, Rd, #0xcafebabe
>
>does not set the flags in the same way as
>
>  SUBS	Rd, Rd, #0xbe
>  SUBS  Rd, Rd, #0xba00
>  SUBS	Rd, Rd, #0xfe0000
>  SUBS	Rd, Rd, #0xca000000
>
>At the very least this needs an assert that you are not trying to set the flags (ie. s==false). However, I think it would be prefereable for the caller to pass in a temp which can be used in this case as it will generate better code using movw/movt/add. The temp can be option and set to noreg by default.
>

You are right. The reason we made this change is when the test case jdk/test/javax/sql/testng/test/rowset/webrowset/WebRowSetTests.java is run with -Xcomp, the following assert(Rd != Rn) fails with both Rd and Rn are r9. A simplified backtrace is:
#5 Assembler::add_sub_imm                  at assembler_aarch32.cpp
#6 Assembler::add                          at assembler_aarch32.hpp
#7 Address::lea                            at assembler_aarch32.cpp
#8 MacroAssembler::lea                     at macroAssembler_aarch32.hpp
#9 MacroAssembler::lookup_interface_method at macroAssembler_aarch32.cpp

Thank you for your suggestion. I make a fixup and it seems ok for the our local softfp template interpreter and hardfp client version. This fixup is roughly like as follow and is included in the new patch 161021-softfp.hg.diff:
/////////////////////////////// patch begin /////////////////////////////

--- a/hotspot/src/cpu/aarch32/vm/macroAssembler_aarch32.cpp
+++ b/hotspot/src/cpu/aarch32/vm/macroAssembler_aarch32.cpp
@@ -818,7 +818,7 @@ void MacroAssembler::lookup_interface_method(Register recv_klass,
   // lea(recv_klass, Address(recv_klass, itable_index, Address::times_ptr, itentry_off));
   lea(recv_klass, itable_index.is_register() ?
           Address(recv_klass, itable_index, lsl(2)) :
-          Address(recv_klass, itable_index.as_constant() << 2));
+          Address(recv_klass, itable_index.as_constant() << 2), r2);
   if (itentry_off)
     add(recv_klass, recv_klass, itentry_off);


--- a/hotspot/src/cpu/aarch32/vm/macroAssembler_aarch32.hpp
+++ b/hotspot/src/cpu/aarch32/vm/macroAssembler_aarch32.hpp
@@ -123,10 +123,10 @@ class MacroAssembler: public Assembler {
     atomic_inc(tmp1, tmp2);
   }
   // Load Effective Address
-  void lea(Register r, const Address &a) {
+  void lea(Register r, const Address &a, Register temp = rscratch1) {
     InstructionMark im(this);
     code_section()->relocate(inst_mark(), a.rspec());
-    a.lea(this, r);
+    a.lea(this, r, temp);
   }
 
   inline void addm(Address a, Register incr, Register scratch) {
/////////////////////////////// patch end /////////////////////////////

>
>   //TODO Misc instructions
>+  void clz(Register Rd, Register Rm, Condition cond = C_DFLT) {
>+    starti;
>+    f(cond, 31, 28), f(0b000101101111, 27, 16);
>+    rf(Rd, 12), f(0b11110001, 11, 4);
>+    rf(Rm, 0);
>+  }
>+
>
>I can't find any usage of this clz method. Is there meant to be any?

Yes, we optimize integer division, clz() is meant to be invoked there. We will push a patch for this in near future. For now I simply remove this code in the new patch.

>
>+void swap_register(Register r1, Register r2) {
>+  eor(r1, r1, r2);
>+  eor(r2, r1, r2);
>+  eor(r1, r1, r2);
>+}
>
>I dislike this use of eor to swap regs. It is possible to use a temp. mov tmp, r1; mov r1, r2; mov r2, tmp is clearer and may also be more efficient on some implementations.
>

Thanks for your opinion. Our new version of swap_register() would be:
void swap_register(Register r1, Register r2, Register temp = scratch1) {
  if (r1 != r2) {
    mov(temp, r1);
    mov(r1, r2);
    mov(r2, temp);
  }
}

>
>I am in the process of building and testing this which will take a few days as there are quite a few combinations (hardfp/softfp, core/client, release/debug) and I want to be sure that all combinations still work and nothing has been broken.
>
>It would be really helpful if someone else could help review this patch as well as it is quite a big patch.
>
>All the best,
>Ed.
>