RFR(JDK11/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Paul Sandoz paul.sandoz at oracle.com
Tue May 1 00:13:46 UTC 2018



> On Apr 30, 2018, at 4:47 PM, Joe Wang <huizhe.wang at oracle.com> wrote:
>> 
>>>> 
>> It’s tempting (well to me at least) to generalize to a mismatch method (like for arrays) returning the mismatching location in bytes, then you can determine if one file is a prefix of another given the files sizes. Bound accepting methods would also be useful to mismatch on partial content (including within the same file). If you use memory mapped files we can use direct byte buffers to efficiently perform the mismatch.
> 
> Are there real-life use cases?  It may be useful for example to check if the files have the same header.
> 

Yes, something like that. I was just searching for a more general abstraction e.g. mismatch, that can support equality and lexicographical comparison of file contents. Other use-cases tend pop out almost for free because of that :-) However, its possible to support the more advanced cases directly with mapped byte buffers.

The good news is you can add isSameContent and if there is demand for mismatch add that, deriving the implementation of isSameContent from the new method.

Paul.

> We did a bit of use-case study where we compared a bunch of possible options, including read string with bound, or by specifying patterns, and/or read into a list with a regex/pattern as separator (vs the default line-separator). We concluded that readString is a popular demand, and it's usually a quick read of small files, e.g. a config file, a SQL query file and etc. The methods fulfill the process of String <==> File transformation, a straight and quick way of converting a String to File and vice versa.
> 
> The demand for isSameContent isn't necessarily as popular as readString, but there were still some real use cases where people asked how to do it quickly. When we have String <==> File, it's natural to at least have a comparison method since String.equal is essential to it. Plus, we already had isSameFile.
> 
> Best,
> Joe
> 
>> 
>> To Remi’s point this might dissuade/guide developers from using this method when there are other more efficient techniques available when operating at larger scales. However, it is unfortunately harder that it should be in Java to hash the contents of a file, a byte[] or ByteBuffer, according to some chosen algorithm (or a good default).
>> 
>> Paul.
> 



More information about the nio-dev mailing list