RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Fri Oct 12 19:16:02 UTC 2018

Hi all,

Here's an update based on all of the great reviews and comments (thanks 
all!):

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285
CSR: https://bugs.openjdk.java.net/browse/JDK-8202302

Current version:
specdiff: 
http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Files.html
webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/

Previous version:
specdiff: 
http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v01/java/nio/file/Files.html
webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v01/

It's been a while, so here's a summary of what's changed:

Spec: Alan's patch to fix inconsistencies in wording/terminology
Impl: s/mismatchByAttrs and etc/isSameFile;
          Stick with the simple implementation (InputStream/Buffer) for 
now. Further improvement may be done if there's a need;
          The impl is smaller than the previous version, merged the code 
into Files, removed FilesMismatch.java;

Test: more readable variables instead of the array of test files
          more comments on the testcases
          improved the private methods used for generating the test files

Thanks,
Joe

On 9/21/18, 6:49 PM, Joe Wang wrote:
>
> On 9/20/18, 2:46 PM, Stuart Marks wrote:
>>
>>
>> On 9/19/18 11:48 AM, Joe Wang wrote:
>>> After much discussion and 10 iterations of reviews, this proposal 
>>> has evolved from what was the original isSameContent method to a 
>>> mismatch method. API-wise, a compare method was also considered as 
>>> it looked like just a short step forward from mismatch, however, it 
>>> was eventually dropped since there is no convincing use case 
>>> comparing files lexicographically by contents. Impl-wise, extensive 
>>> performance benchmarking has been done to compare a buffered reading 
>>> vs memory mapping, the result was that a simple buffered reading 
>>> performed better among small files, and those with the mismatched 
>>> byte closer to the beginning of files. Since the proposed method's 
>>> targeted files are small ones, the impl currently does a buffered 
>>> reading only.
>>
>> Hi Joe,
>>
>> Thanks for being persistent with this one!
>
> Thanks for the help, and the "lot of stuff" with clear indicators 
> (e.g. line numbers and etc.)!  From JDK 11 to 12, hopefully wont be 
> deferred to 13 :-)
>>
>> A very small spec nitpick:
>>
>>> specdiff: 
>>> http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files.html
>>
>> 1544      * <li> {@link #isSameFile(Path, Path) isSameFile(path, 
>> path2)} returns true,</li>
>>
>> I would add "or" after the trailing comma. This makes it similar to 
>> the two-item list that follows.
>
> Done.
>>
>>> webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/
>>
>> A couple minor comments on FilesMismatch.java:
>>
>> If mismatchByAttrs() is replaced with a call to isSameFile(), as Alan 
>> suggested, this simplifies things considerably. It looks like the 
>> mismatch() implementation will be reduced to around ~40 lines of 
>> code. Does it deserve its own file, or can it be placed directly into 
>> Files.java? That file has over 4,000 lines already though.
>
> I merged the private methods after removing mismatchByAttrs(), that 
> reduced the code to 33 lines, and then with the change you suggested 
> below, it's further reduced to 29 lines. I'll put them in Files for now.
>
> Files is too big,  ~180K, that compares with the next one ~45K among 
> files in the nio/file package. Unfortunately, the great majority of 
> them are specs, probably 90%.
>>
>>  106             if (nRead1 == 0 || nRead2 == 0) {
>>  107                 if (nRead1 == 0 && nRead2 == 0) {
>>  108                     // both files reach EOF
>>  109                     return -1;
>>  110                 } else {
>>  111                     // one of the files reaches EOF
>>  112                     return totalRead;
>>  113                 }
>>  114             } else if (nRead1 != nRead2) {
>>
>> I think it reads slightly better if the nested 'if' at lines 107-113 
>> were flattened into the else-if chain, like so:
>>
>>             if (nRead1 == 0 && nRead2 == 0) {
>>                 // both files reached EOF
>>                 return -1;
>>             } else if (nRead1 == 0 || nRead2 == 0) {
>>                 // one but not both files reached EOF
>>                 return totalRead;
>>             } else if (nRead1 != nRead2) {
>>
>> There are a couple places where lines can be joined. The resulting 
>> lines are longer than 80 characters but are still less than 100 
>> characters, which I think is OK. (I don't think we have a strict 
>> limit of 80 anymore.)
>>
>>   97     private static long mismatch(InputStream in1, InputStream in2)
>>   98         throws IOException {
>>
>>  117                 return totalRead +
>>  118                     Arrays.mismatch(buffer1, 0, len, buffer2, 0, 
>> len);
>
> Done. 95 characters :-)
>
>>
>> Comments on tests in Comparison.java:
>>
>> * This file, and several names within this file, should probably 
>> should be renamed to focus on mismatch instead of comparison.
>
> Will rename to "mismatch". Previously, it covers both mismatch and 
> compare.
>>
>> * I'm finding it quite difficult to read the test cases.
>>
>> The test files are placed into a Path[] and referenced by index. This 
>> makes it quite difficult to determine, for example, if all six cases 
>> within the mismatch loop are covered.
>>
>> Instead of storing the files at specific array indexes, I'd suggest 
>> doing the following.
>>
>> 1) Establish good naming conventions for test files, including size 
>> information, covering duplicate files (e.g., 'a' and 'b'), and a 
>> modification at a particular offset. For example, you might do 
>> something like the following:
>>
>>     test120a    - a test file with size 120
>>     test120b    - a test file with size 120, identical to test120a
>>     test120m110 - a test file with size 120, with an altered byte at 110
>>     etc.
>
> I can do that. I had hoped the Assertion msg, e.g. "Both files are 
> empty, no mismatch" would help. Along with renaming variables, I will 
> also add to the Assertion Msg with test case number, e.g. "Case#1: 
> both files are empty, return no mismatch".
>>
>> 2) Declare fields of the same name containing the file paths.
>>
>> 3) Add some logic in prepareTestFile() to append each file's path to 
>> a list after the file is created. Then have the @AfterClass method 
>> run through this list and delete the files.
>
> May also create the test files in a folder and delete it @AfterClass.
>>
>> 4) The test data generator method can then have lines of test data 
>> that reference the test files using a meaningful name, so it's easy 
>> to understand the names and the expected results. Also, the test data 
>> generator should have a comment that describes the test method 
>> parameters.
>
> Sure, will add comments and etc.
>>
>> * The generation of the test data files seems overly complicated.
>>
>> A StringBuilder is used to accumulate data, then it's turned into a 
>> String, then (optionally) a character is replaced by the insertChar() 
>> method, and then the string is written to the file.
>>
>> I think you can just accumulate the test data into a StringBuilder, 
>> (optionally) call sb.setCharAt() to modify the data, and then write 
>> the SB to the file. After all, Files.writeString() takes a CharSequence.
>>
>> The generateFixedString() method should be changed to create and 
>> return a fresh StringBuilder with the desired contents. (See above.) 
>> It can also be simplified quite a bit.
>
> Wouldn't it be nice if String has an utility method that generates 
> random ASCII strings? ;-)  It can be faster with its internal byte array.
>>
>> I don't think you need to have methods with Charset parameters. It's 
>> reasonable to have the test files be text, which makes things easier 
>> to develop and debug. But the test depends on characters being 
>> written in a single-byte charset, because mismatch() reports *byte* 
>> positions and not *character* positions.
>>
>> Using the method overloads that are implicitly UTF-8 is incorrect. If 
>> a non-single-byte character were to slip into the test data (seems 
>> unlikely, but possible), it'll be expanded to multiple bytes by the 
>> UTF-8 encoder, which may throw off the result. Instead I'd suggest 
>> hardwiring the charset to be US_ASCII to enforce the 
>> one-byte-per-character invariant.
>>
>> This is a lot of stuff. If you can't follow it all, perhaps we can 
>> work together directly and go over the test file.
>
> I'll improve the test, and US_ASCII it is! I'll probably use a string 
> of fixed length to generate the longest string the test needs, and 
> then get substrings for other test cases, that may simplify the process.
>
> Thanks,
> Joe
>
>>
>> Thanks,
>>
>> s'marks