<!DOCTYPE html><html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>Ok, the numbers in your benchmark match my expectation.</p>

    <p>So, if that approach doesn't match unsafe performance in the 1brc

      challenge (or come very close to it), I'm afraid the culprit is

      not the bound checks, as much as the time it takes for the var

      handle machinery to warm up (inline, unroll and drop the checks).</p>

    <p>We're aware of the startup/warmup advantage of Unsafe vs. FFM and

      we will be doing more in order to bridge the gap (a similar

      argument holds for JNI calls vs. FFM linker calls).</p>

    <p>Maurizio<br>

    </p>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 15/01/2024 17:05, Quân Anh Mai

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:CAPvyiyK+1QhsS-a6PsB1o+Wo43mVREBr-A=mA5Awz93-cXT_Pw@mail.gmail.com">

      

      <div dir="ltr">

        <div>Sure, I just thought that looking at the instruction count

          would be more helpful, since each machine would express

          different performance behaviours. For example, my machine

          shows dependency bound going from [2] to [1] below, which

          leads to a much smaller margin of execution time compared to

          the margin measured by other machines (such as the test

          machine). The third implementation is similar to the first

          one, except I use safe accesses in the form of bounded memory

          segment accesses and varhandles.</div>

        <div><br>

        </div>

        <div>The JMH numbers for these versions look like this, I define

          an execute function which is:</div>

        <div><br>

        </div>

        <div>    @Benchmark<br>

              public PoorManMap execute() throws IOException {<br>

                  try (var file = FileChannel.open(Path.of(FILE),

          StandardOpenOption.READ);<br>

                       var arena = Arena.ofShared()) {<br>

                      var data = file.map(MapMode.READ_ONLY, 0,

          file.size(), arena);<br>

                      return processFile(data, 0, data.byteSize());<br>

                  }<br>

              }<br>

        </div>

        <div><br>

        </div>

        <div>    CalculateAverage_merykitty.execute      avgt    5

           7.422 ± 0.093  ms/op // unsafe [1]<br>

        </div>

            CalculateAverage_merykitty.execute      avgt    5  7.686 ±

        0.181  ms/op // universe segment [2]

        <div>    CalculateAverage_merykitty.execute      avgt    5

           9.009 ± 0.058  ms/op // varhandle [3]<br>

        </div>

        <div><br>

        </div>

        <div>[1]: <a href="https://urldefense.com/v3/__https://github.com/merykitty/1brc/tree/main__;!!ACWV5N9M2RV99hQ!IUCdtouLGOCslnu12ztV0zav6VwnkUFY-SKEQjIpQqeFu1BcYMR23QSVWPOHlO9374x1qxH67yVJEBtQtnyAww$" moz-do-not-send="true">https://github.com/merykitty/1brc/tree/main</a></div>

        <div>[2]: <a href="https://urldefense.com/v3/__https://github.com/merykitty/1brc/tree/removeunsafe__;!!ACWV5N9M2RV99hQ!IUCdtouLGOCslnu12ztV0zav6VwnkUFY-SKEQjIpQqeFu1BcYMR23QSVWPOHlO9374x1qxH67yVJEBv6-AEpwA$" moz-do-not-send="true">https://github.com/merykitty/1brc/tree/removeunsafe</a></div>

        <div>[3]: <a href="https://urldefense.com/v3/__https://github.com/merykitty/1brc/tree/varhandles__;!!ACWV5N9M2RV99hQ!IUCdtouLGOCslnu12ztV0zav6VwnkUFY-SKEQjIpQqeFu1BcYMR23QSVWPOHlO9374x1qxH67yVJEBtlf-1yaA$" moz-do-not-send="true">https://github.com/merykitty/1brc/tree/varhandles</a></div>

        <div><br>

        </div>

        <div>Best regards,</div>

        <div>Quan Anh</div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Tue, 16 Jan 2024 at 00:29,

          Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" moz-do-not-send="true" class="moz-txt-link-freetext">maurizio.cimadamore@oracle.com</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

          <div>

            <p><br>

            </p>

            <div>On 15/01/2024 15:44, Quân Anh Mai wrote:<br>

            </div>

            <blockquote type="cite">

              <div dir="ltr">Running the same program on 1e6 lines

                results in only 9e9 instructions, so I think the vast

                majority of the instruction count is of the compiled

                code. Not using the universe segment is roughly

                equivalent to my previous version, which would result in

                around 50% more instructions compared to using one, and

                almost double the instruction count of using Unsafe.</div>

            </blockquote>

            <p>Without looking at the program some more, it's hard for

              me to make some sense of these numbers. I'm surprised that

              you don't see any difference when using unbounded segment

              compared to regular ones. I wonder if the gap you are

              seeing is due to the JVM warming up, rather than peak

              performances being worse. Have you tried measuring peak

              performance with e.g. JMH? I would not expect to see 20%

              difference there...<br>

            </p>

            <p>Maurizio<br>

            </p>

            <blockquote type="cite">

              <div dir="ltr">

                <div><br>

                </div>

                <div>Regards,</div>

                <div>Quan Anh</div>

              </div>

              <br>

              <div class="gmail_quote">

                <div dir="ltr" class="gmail_attr">On Mon, 15 Jan 2024 at

                  23:09, Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">maurizio.cimadamore@oracle.com</a>>

                  wrote:<br>

                </div>

                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                  <div>

                    <p>I think the increased instruction count is

                      normal, as C2 had to do more work to optimize the

                      bound checks away?</p>

                    <p>Is there any difference compared to the version

                      that doesn't use the universe segment?</p>

                    <p>Maurizio<br>

                    </p>

                    <div>On 15/01/2024 13:52, Quân Anh Mai wrote:<br>

                    </div>

                    <blockquote type="cite">

                      <div dir="ltr">

                        <div>Hi,</div>

                        <div><br>

                        </div>

                        <div>I have tried using a universe segment

                          instead of Unsafe, and store the custom

                          hashmap buffer in off-heap instead of using a

                          byte array. The output of perf stat on the

                          program</div>

                        <div><br>

                        </div>

                         Performance counter stats for 'sh

                        calculate_average_merykittyunsafe.sh':<br>

                        <br>

                                  13573.70 msec task-clock:u            

                         #   10.942 CPUs utilized<br>

                                         0      context-switches:u      

                         #    0.000 /sec<br>

                                         0      cpu-migrations:u        

                         #    0.000 /sec<br>

                                    238460      page-faults:u          

                          #   17.568 K/sec<br>

                               61995179870      cycles:u                

                         #    4.567 GHz<br>

                                 261830581    

                         stalled-cycles-frontend:u #    0.42% frontend

                        cycles idle<br>

                                  93823680      stalled-cycles-backend:u

                         #    0.15% backend cycles idle<br>

                              137976098809      instructions:u          

                         #    2.23  insn per cycle<br>

                                                                       

                          #    0.00  stalled cycles per insn<br>

                               18373313803      branches:u              

                         #    1.354 G/sec<br>

                                  43579782      branch-misses:u        

                          #    0.24% of all branches<br>

                        <br>

                               1.240504612 seconds time elapsed<br>

                        <br>

                              12.841563000 seconds user<br>

                               0.652428000 seconds sys

                        <div><br>

                        </div>

                        <div>For comparison, this is the unsafe version:<br>

                          <div><br>

                          </div>

                          <div> Performance counter stats for 'sh

                            calculate_average_merykittyunsafe.sh':<br>

                            <br>

                                      13327.46 msec task-clock:u        

                                 #   11.202 CPUs utilized<br>

                                             0      context-switches:u  

                                 #    0.000 /sec<br>

                                             0      cpu-migrations:u    

                                 #    0.000 /sec<br>

                                        269896      page-faults:u      

                                  #   20.251 K/sec<br>

                                   61258348752      cycles:u            

                                 #    4.596 GHz<br>

                                     639839262    

                             stalled-cycles-frontend:u #    1.04%

                            frontend cycles idle<br>

                                     108018676    

                             stalled-cycles-backend:u  #    0.18%

                            backend cycles idle<br>

                                  113476168983      instructions:u      

                                 #    1.85  insn per cycle<br>

                                                                       

                                  #    0.01  stalled cycles per insn<br>

                                   11442665370      branches:u          

                                 #  858.578 M/sec<br>

                                      44590172      branch-misses:u    

                                  #    0.39% of all branches<br>

                            <br>

                                   1.189768677 seconds time elapsed<br>

                            <br>

                                  12.628512000 seconds user<br>

                                   0.620083000 seconds sys<br>

                          </div>

                        </div>

                        <div><br>

                        </div>

                        <div>This program running on my machine

                          expresses dependency bound so the difference

                          in execution time is not as significant as on

                          the test machine but it can be seen that

                          removing Unsafe results in over 21% increase

                          in instruction count.</div>

                        <div><br>

                        </div>

                        <div>Regards,</div>

                        <div>Quan Anh</div>

                      </div>

                      <br>

                      <div class="gmail_quote">

                        <div dir="ltr" class="gmail_attr">On Sat, 13 Jan

                          2024 at 01:29, Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">maurizio.cimadamore@oracle.com</a>>

                          wrote:<br>

                        </div>

                        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

                          On 12/01/2024 17:26, Quân Anh Mai wrote:<br>

                          > FYI, in my submission to 1brc, using

                          Unsafe decreases the execution <br>

                          > time from 3.25s to 2.57s on the test

                          machine.<br>

                          <br>

                          Just curious - what is the difference compared

                          with the everything <br>

                          segment trick?<br>

                          <br>

                          (While I know it can't do on-heap access,

                          perhaps you can tweak the code <br>

                          to be all off-heap?)<br>

                          <br>

                          Maurizio<br>

                          <br>

                        </blockquote>

                      </div>

                    </blockquote>

                  </div>

                </blockquote>

              </div>

            </blockquote>

          </div>

        </blockquote>

      </div>

    </blockquote>

  </body>

</html>