<div dir="ltr"><div>Sure, I just thought that looking at the instruction count would be more helpful, since each machine would express different performance behaviours. For example, my machine shows dependency bound going from [2] to [1] below, which leads to a much smaller margin of execution time compared to the margin measured by other machines (such as the test machine). The third implementation is similar to the first one, except I use safe accesses in the form of bounded memory segment accesses and varhandles.</div><div><br></div><div>The JMH numbers for these versions look like this, I define an execute function which is:</div><div><br></div><div>    @Benchmark<br>    public PoorManMap execute() throws IOException {<br>        try (var file = FileChannel.open(Path.of(FILE), StandardOpenOption.READ);<br>             var arena = Arena.ofShared()) {<br>            var data = file.map(MapMode.READ_ONLY, 0, file.size(), arena);<br>            return processFile(data, 0, data.byteSize());<br>        }<br>    }<br></div><div><br></div><div>    CalculateAverage_merykitty.execute      avgt    5  7.422 ± 0.093  ms/op // unsafe [1]<br></div>    CalculateAverage_merykitty.execute      avgt    5  7.686 ± 0.181  ms/op // universe segment [2]<div>    CalculateAverage_merykitty.execute      avgt    5  9.009 ± 0.058  ms/op // varhandle [3]<br></div><div><br></div><div>[1]: <a href="https://github.com/merykitty/1brc/tree/main">https://github.com/merykitty/1brc/tree/main</a></div><div>[2]: <a href="https://github.com/merykitty/1brc/tree/removeunsafe">https://github.com/merykitty/1brc/tree/removeunsafe</a></div><div>[3]: <a href="https://github.com/merykitty/1brc/tree/varhandles">https://github.com/merykitty/1brc/tree/varhandles</a></div><div><br></div><div>Best regards,</div><div>Quan Anh</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 16 Jan 2024 at 00:29, Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com">maurizio.cimadamore@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>

  
  <div>
    <p><br>
    </p>
    <div>On 15/01/2024 15:44, Quân Anh Mai
      wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">Running the same program on 1e6 lines results in
        only 9e9 instructions, so I think the vast majority of the
        instruction count is of the compiled code. Not using the
        universe segment is roughly equivalent to my previous version,
        which would result in around 50% more instructions compared to
        using one, and almost double the instruction count of using
        Unsafe.</div>
    </blockquote>
    <p>Without looking at the program some more, it's hard for me to
      make some sense of these numbers. I'm surprised that you don't see
      any difference when using unbounded segment compared to regular
      ones. I wonder if the gap you are seeing is due to the JVM warming
      up, rather than peak performances being worse. Have you tried
      measuring peak performance with e.g. JMH? I would not expect to
      see 20% difference there...<br>
    </p>
    <p>Maurizio<br>
    </p>
    <blockquote type="cite">
      <div dir="ltr">
        <div><br>
        </div>
        <div>Regards,</div>
        <div>Quan Anh</div>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Mon, 15 Jan 2024 at 23:09,
          Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank">maurizio.cimadamore@oracle.com</a>>
          wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div>
            <p>I think the increased instruction count is normal, as C2
              had to do more work to optimize the bound checks away?</p>
            <p>Is there any difference compared to the version that
              doesn't use the universe segment?</p>
            <p>Maurizio<br>
            </p>
            <div>On 15/01/2024 13:52, Quân Anh Mai wrote:<br>
            </div>
            <blockquote type="cite">
              <div dir="ltr">
                <div>Hi,</div>
                <div><br>
                </div>
                <div>I have tried using a universe segment instead of
                  Unsafe, and store the custom hashmap buffer in
                  off-heap instead of using a byte array. The output of
                  perf stat on the program</div>
                <div><br>
                </div>
                 Performance counter stats for 'sh
                calculate_average_merykittyunsafe.sh':<br>
                <br>
                          13573.70 msec task-clock:u              #  
                10.942 CPUs utilized<br>
                                 0      context-switches:u        #  
                 0.000 /sec<br>
                                 0      cpu-migrations:u          #  
                 0.000 /sec<br>
                            238460      page-faults:u             #  
                17.568 K/sec<br>
                       61995179870      cycles:u                  #  
                 4.567 GHz<br>
                         261830581      stalled-cycles-frontend:u #  
                 0.42% frontend cycles idle<br>
                          93823680      stalled-cycles-backend:u  #  
                 0.15% backend cycles idle<br>
                      137976098809      instructions:u            #  
                 2.23  insn per cycle<br>
                                                                  #  
                 0.00  stalled cycles per insn<br>
                       18373313803      branches:u                #  
                 1.354 G/sec<br>
                          43579782      branch-misses:u           #  
                 0.24% of all branches<br>
                <br>
                       1.240504612 seconds time elapsed<br>
                <br>
                      12.841563000 seconds user<br>
                       0.652428000 seconds sys
                <div><br>
                </div>
                <div>For comparison, this is the unsafe version:<br>
                  <div><br>
                  </div>
                  <div> Performance counter stats for 'sh
                    calculate_average_merykittyunsafe.sh':<br>
                    <br>
                              13327.46 msec task-clock:u              #
                      11.202 CPUs utilized<br>
                                     0      context-switches:u        #
                       0.000 /sec<br>
                                     0      cpu-migrations:u          #
                       0.000 /sec<br>
                                269896      page-faults:u             #
                      20.251 K/sec<br>
                           61258348752      cycles:u                  #
                       4.596 GHz<br>
                             639839262      stalled-cycles-frontend:u #
                       1.04% frontend cycles idle<br>
                             108018676      stalled-cycles-backend:u  #
                       0.18% backend cycles idle<br>
                          113476168983      instructions:u            #
                       1.85  insn per cycle<br>
                                                                      #
                       0.01  stalled cycles per insn<br>
                           11442665370      branches:u                #
                     858.578 M/sec<br>
                              44590172      branch-misses:u           #
                       0.39% of all branches<br>
                    <br>
                           1.189768677 seconds time elapsed<br>
                    <br>
                          12.628512000 seconds user<br>
                           0.620083000 seconds sys<br>
                  </div>
                </div>
                <div><br>
                </div>
                <div>This program running on my machine expresses
                  dependency bound so the difference in execution time
                  is not as significant as on the test machine but it
                  can be seen that removing Unsafe results in over 21%
                  increase in instruction count.</div>
                <div><br>
                </div>
                <div>Regards,</div>
                <div>Quan Anh</div>
              </div>
              <br>
              <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">On Sat, 13 Jan 2024 at
                  01:29, Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank">maurizio.cimadamore@oracle.com</a>>
                  wrote:<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
                  On 12/01/2024 17:26, Quân Anh Mai wrote:<br>
                  > FYI, in my submission to 1brc, using Unsafe
                  decreases the execution <br>
                  > time from 3.25s to 2.57s on the test machine.<br>
                  <br>
                  Just curious - what is the difference compared with
                  the everything <br>
                  segment trick?<br>
                  <br>
                  (While I know it can't do on-heap access, perhaps you
                  can tweak the code <br>
                  to be all off-heap?)<br>
                  <br>
                  Maurizio<br>
                  <br>
                </blockquote>
              </div>
            </blockquote>
          </div>
        </blockquote>
      </div>
    </blockquote>
  </div>

</blockquote></div>