[foreign-memaccess] Add direct access (DAX) support to MappedMemorySegment

Tue Apr 6 13:46:20 UTC 2021

Hi Maurizio,

you're right, performance-wise slicing doesn't seem to be a problem as 
it apparently gets optimized pretty well.

Benchmark                                               Mode Cnt         
Score        Error   Units
SliceBenchmark.measureIndexedForce                      thrpt    5 
10478563.869 ± 115046.765   ops/s
SliceBenchmark.measureIndexedForce:·gc.alloc.rate       thrpt 5        ≈ 
10⁻⁵               MB/sec
SliceBenchmark.measureIndexedForce:·gc.alloc.rate.norm  thrpt 5        ≈ 
10⁻⁶                 B/op
SliceBenchmark.measureIndexedForce:·gc.count            thrpt 
5           ≈ 0               counts
SliceBenchmark.measureSlicedForce                       thrpt    5 
10670895.207 ±  19753.296   ops/s
SliceBenchmark.measureSlicedForce:·gc.alloc.rate        thrpt 5        ≈ 
10⁻⁵               MB/sec
SliceBenchmark.measureSlicedForce:·gc.alloc.rate.norm   thrpt 5        ≈ 
10⁻⁶                 B/op
SliceBenchmark.measureSlicedForce:·gc.count             thrpt 
5           ≈ 0               counts

The only API-wise issue I see here is that it might not be obvious that 
forcing without a slice can have some serious performance implications 
and on first glance it looks like there's no way of precisely forcing 
changes like it is with MappedByteBuffers.
This is also less of a problem with traditionally memory mapped files as 
the file system keeps track of the dirty pages and only flushes those on 
msync. Only when using DAX the force() has a runtime linear to the 
segment size, independent of the dirty cache lines.
But I also see that extending the API might not be necessary if one is 
aware of the MemorySegment-philosophy: "Want to do something only on a 
part of a segment -> slice it!".

Best Regards
Marcel

On 06.04.21 12:04, Maurizio Cimadamore wrote:
> Hi Marcel, replies inline
>
> On 03/04/2021 22:38, Marcel Käufler wrote:
>> Hi all,
>>
>> I'm currently working with the Foreign Memory Access API and 
>> (emulated) non-volatile RAM. With JDK 14 support for non-volatile 
>> memory was added to MappedByteBuffers by mapping with 
>> ExtendedMapMode.READ_ONLY_SYNC or ExtendedMapMode.READ_WRITE_SYNC.
>> Calling force() on the MappedByteBuffer will then just flush caches 
>> instead of invoking msync and also reading won't use the page cache.
>>
>> MappedMemorySegment already builds on the same logic and would be 
>> NVM-aware but unfortunately mapping with an ExtendedMapMode is 
>> currently not supported. The only way to map a MemorySegment in sync 
>> mode is to first map a ByteBuffer and then use 
>> MemorySegment.ofByteBuffer() which of course comes with some 
>> limitations.
>>
>> From my observation the only issue is the openOptions() method in 
>> MappedMemorySegmentImpl which does not consider the two SYNC modes. 
>> After adding the modes to the respective conditions I was able call 
>> `MemorySegment.mapFile(path, offset, size, 
>> ExtendedMapMode.READ_WRITE_SYNC)` and it worked just as expected.
>>
>>
>>     private static OpenOption[] openOptions(FileChannel.MapMode 
>> mapMode) {
>>         if (mapMode == FileChannel.MapMode.READ_ONLY || mapMode == 
>> ExtendedMapMode.READ_ONLY_SYNC) {
>>             return new OpenOption[] { StandardOpenOption.READ };
>>         } else if (mapMode == FileChannel.MapMode.READ_WRITE || 
>> mapMode == FileChannel.MapMode.PRIVATE || mapMode == 
>> ExtendedMapMode.READ_WRITE_SYNC) {
>>             return new OpenOption[] { StandardOpenOption.READ, 
>> StandardOpenOption.WRITE };
>>         } else {
>>             throw new UnsupportedOperationException("Unsupported map 
>> mode: " + mapMode);
>>         }
>>     }
>>
>> Is there anything against adding this?
>
> I agree there seems to be something odd here... this code was meant to 
> replicate what was there in FileChannelImpl, but apparently something 
> is amiss here and ExtendedMapMode have been left out.
>
> This should be fixed.
>
>>
>>
>> Additionally MappedByteBuffer offers a `force(int index, int length)` 
>> method whereas for MappedMemorySegments there's only a 
>> `MappedMemorySegments.force(memorySegment)`.
>> In DAX mode the later is horribly slow because it iterates over the 
>> whole segment in 64 byte steps to evict cache lines. A targeted force 
>> can already be accomplished by slicing first and calling force on the 
>> slice. When working on NVM and frequently flushing cache lines, this 
>> creates a lot of throwaway MemorySegments for the gc to collect. 
>> Admitted, this overhead is probably negligible compared to the NVM 
>> write but a method with offset and length would be nice to match the 
>> MappedByteBuffer API.
>>
>> Everything needed is also already present and it would be easy to add 
>> a `force(MemorySegment segment, long offset, long length)`:
>>
>> In MappedMemorySegments:
>>
>>     public static void force(MemorySegment segment, long offset, long 
>> length) {
>>         toMappedSegment(segment).force(offset, length);
>>     }
>>
>> In MappedMemorySegmentImpl:
>>
>>     public void force(long offset, long length) {
>>         checkBounds(offset, length); // used from 
>> AbstractMemorySegmentImpl if made protected (out-of-bounds message 
>> with "new offset" and "new length" doesn't fit exactly, thought)
>>         SCOPED_MEMORY_ACCESS.force(scope, unmapper.fileDescriptor(), 
>> min, unmapper.isSync(), offset, length);
>>     }
>>
>> Thoughts on this?
>
> As discussed in other related topics [1], while I've nothing against 
> the proposed method, do you have any benchmark showing that there is 
> additional GC pressure, or slower throughput when using
>
> force(segment.asSlice(offset, length)) ?
>
> The reason I'm asking is that the API already has a way to create 
> slices out of a segment, which supports all the possible overloads 
> that user might want to use (note that there are _four_ versions of 
> asSlice). It would be sad to replicate all that into 
> MappedMemorySegment, because what you are looking for here is, 
> essentially, a slicing mechanism. Note also that, when Valhalla comes, 
> the cost of creating slices should go down regardless of C2 
> optimizations - so I'm wary here of adding what looks like an 
> "interim" API.
>
> Of course if benchmarks show that, in this case, slice creation is a 
> problem I have no issue adding an escape hatch for the time being.
>
> (I suggest creating a JMH benchmark and then profiling with the JMH 
> option "-prof gc" which shows allocation rate).
>
> Maurizio
>
> [1] - 
> https://mail.openjdk.java.net/pipermail/panama-dev/2021-April/012897.html
>
>>
>>
>> Best Regards
>> Marcel