[foreign-memaccess] Add direct access (DAX) support to MappedMemorySegment
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Tue Apr 6 13:59:56 UTC 2021
On 06/04/2021 14:46, Marcel Käufler wrote:
> Hi Maurizio,
>
> you're right, performance-wise slicing doesn't seem to be a problem as
> it apparently gets optimized pretty well.
>
>
> Benchmark Mode
> Cnt Score Error Units
> SliceBenchmark.measureIndexedForce thrpt 5
> 10478563.869 ± 115046.765 ops/s
> SliceBenchmark.measureIndexedForce:·gc.alloc.rate thrpt 5
> ≈ 10⁻⁵ MB/sec
> SliceBenchmark.measureIndexedForce:·gc.alloc.rate.norm thrpt 5
> ≈ 10⁻⁶ B/op
> SliceBenchmark.measureIndexedForce:·gc.count thrpt
> 5 ≈ 0 counts
> SliceBenchmark.measureSlicedForce thrpt 5
> 10670895.207 ± 19753.296 ops/s
> SliceBenchmark.measureSlicedForce:·gc.alloc.rate thrpt 5
> ≈ 10⁻⁵ MB/sec
> SliceBenchmark.measureSlicedForce:·gc.alloc.rate.norm thrpt 5
> ≈ 10⁻⁶ B/op
> SliceBenchmark.measureSlicedForce:·gc.count thrpt
> 5 ≈ 0 counts
>
>
> The only API-wise issue I see here is that it might not be obvious
> that forcing without a slice can have some serious performance
> implications and on first glance it looks like there's no way of
> precisely forcing changes like it is with MappedByteBuffers.
> This is also less of a problem with traditionally memory mapped files
> as the file system keeps track of the dirty pages and only flushes
> those on msync. Only when using DAX the force() has a runtime linear
> to the segment size, independent of the dirty cache lines.
> But I also see that extending the API might not be necessary if one is
> aware of the MemorySegment-philosophy: "Want to do something only on a
> part of a segment -> slice it!".
I understand what you say. Perhaps we might consider adding something to
the javadoc of MappedMemorySegments::force? E.g. "if working with big
NVM mapped files, please slice" ? :-)
Cheers
Maurizio
>
>
> Best Regards
> Marcel
>
>
> On 06.04.21 12:04, Maurizio Cimadamore wrote:
>> Hi Marcel, replies inline
>>
>> On 03/04/2021 22:38, Marcel Käufler wrote:
>>> Hi all,
>>>
>>> I'm currently working with the Foreign Memory Access API and
>>> (emulated) non-volatile RAM. With JDK 14 support for non-volatile
>>> memory was added to MappedByteBuffers by mapping with
>>> ExtendedMapMode.READ_ONLY_SYNC or ExtendedMapMode.READ_WRITE_SYNC.
>>> Calling force() on the MappedByteBuffer will then just flush caches
>>> instead of invoking msync and also reading won't use the page cache.
>>>
>>> MappedMemorySegment already builds on the same logic and would be
>>> NVM-aware but unfortunately mapping with an ExtendedMapMode is
>>> currently not supported. The only way to map a MemorySegment in sync
>>> mode is to first map a ByteBuffer and then use
>>> MemorySegment.ofByteBuffer() which of course comes with some
>>> limitations.
>>>
>>> From my observation the only issue is the openOptions() method in
>>> MappedMemorySegmentImpl which does not consider the two SYNC modes.
>>> After adding the modes to the respective conditions I was able call
>>> `MemorySegment.mapFile(path, offset, size,
>>> ExtendedMapMode.READ_WRITE_SYNC)` and it worked just as expected.
>>>
>>>
>>> private static OpenOption[] openOptions(FileChannel.MapMode
>>> mapMode) {
>>> if (mapMode == FileChannel.MapMode.READ_ONLY || mapMode ==
>>> ExtendedMapMode.READ_ONLY_SYNC) {
>>> return new OpenOption[] { StandardOpenOption.READ };
>>> } else if (mapMode == FileChannel.MapMode.READ_WRITE ||
>>> mapMode == FileChannel.MapMode.PRIVATE || mapMode ==
>>> ExtendedMapMode.READ_WRITE_SYNC) {
>>> return new OpenOption[] { StandardOpenOption.READ,
>>> StandardOpenOption.WRITE };
>>> } else {
>>> throw new UnsupportedOperationException("Unsupported map
>>> mode: " + mapMode);
>>> }
>>> }
>>>
>>> Is there anything against adding this?
>>
>> I agree there seems to be something odd here... this code was meant
>> to replicate what was there in FileChannelImpl, but apparently
>> something is amiss here and ExtendedMapMode have been left out.
>>
>> This should be fixed.
>>
>>>
>>>
>>> Additionally MappedByteBuffer offers a `force(int index, int
>>> length)` method whereas for MappedMemorySegments there's only a
>>> `MappedMemorySegments.force(memorySegment)`.
>>> In DAX mode the later is horribly slow because it iterates over the
>>> whole segment in 64 byte steps to evict cache lines. A targeted
>>> force can already be accomplished by slicing first and calling force
>>> on the slice. When working on NVM and frequently flushing cache
>>> lines, this creates a lot of throwaway MemorySegments for the gc to
>>> collect. Admitted, this overhead is probably negligible compared to
>>> the NVM write but a method with offset and length would be nice to
>>> match the MappedByteBuffer API.
>>>
>>> Everything needed is also already present and it would be easy to
>>> add a `force(MemorySegment segment, long offset, long length)`:
>>>
>>> In MappedMemorySegments:
>>>
>>> public static void force(MemorySegment segment, long offset,
>>> long length) {
>>> toMappedSegment(segment).force(offset, length);
>>> }
>>>
>>> In MappedMemorySegmentImpl:
>>>
>>> public void force(long offset, long length) {
>>> checkBounds(offset, length); // used from
>>> AbstractMemorySegmentImpl if made protected (out-of-bounds message
>>> with "new offset" and "new length" doesn't fit exactly, thought)
>>> SCOPED_MEMORY_ACCESS.force(scope, unmapper.fileDescriptor(),
>>> min, unmapper.isSync(), offset, length);
>>> }
>>>
>>> Thoughts on this?
>>
>> As discussed in other related topics [1], while I've nothing against
>> the proposed method, do you have any benchmark showing that there is
>> additional GC pressure, or slower throughput when using
>>
>> force(segment.asSlice(offset, length)) ?
>>
>> The reason I'm asking is that the API already has a way to create
>> slices out of a segment, which supports all the possible overloads
>> that user might want to use (note that there are _four_ versions of
>> asSlice). It would be sad to replicate all that into
>> MappedMemorySegment, because what you are looking for here is,
>> essentially, a slicing mechanism. Note also that, when Valhalla
>> comes, the cost of creating slices should go down regardless of C2
>> optimizations - so I'm wary here of adding what looks like an
>> "interim" API.
>>
>> Of course if benchmarks show that, in this case, slice creation is a
>> problem I have no issue adding an escape hatch for the time being.
>>
>> (I suggest creating a JMH benchmark and then profiling with the JMH
>> option "-prof gc" which shows allocation rate).
>>
>> Maurizio
>>
>> [1] -
>> https://mail.openjdk.java.net/pipermail/panama-dev/2021-April/012897.html
>>
>>>
>>>
>>> Best Regards
>>> Marcel
>
More information about the panama-dev
mailing list