RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents
Joe Wang
huizhe.wang at oracle.com
Fri Oct 12 19:16:02 UTC 2018
Hi all,
Here's an update based on all of the great reviews and comments (thanks
all!):
JBS: https://bugs.openjdk.java.net/browse/JDK-8202285
CSR: https://bugs.openjdk.java.net/browse/JDK-8202302
Current version:
specdiff:
http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Files.html
webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/
Previous version:
specdiff:
http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v01/java/nio/file/Files.html
webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v01/
It's been a while, so here's a summary of what's changed:
Spec: Alan's patch to fix inconsistencies in wording/terminology
Impl: s/mismatchByAttrs and etc/isSameFile;
Stick with the simple implementation (InputStream/Buffer) for
now. Further improvement may be done if there's a need;
The impl is smaller than the previous version, merged the code
into Files, removed FilesMismatch.java;
Test: more readable variables instead of the array of test files
more comments on the testcases
improved the private methods used for generating the test files
Thanks,
Joe
On 9/21/18, 6:49 PM, Joe Wang wrote:
>
> On 9/20/18, 2:46 PM, Stuart Marks wrote:
>>
>>
>> On 9/19/18 11:48 AM, Joe Wang wrote:
>>> After much discussion and 10 iterations of reviews, this proposal
>>> has evolved from what was the original isSameContent method to a
>>> mismatch method. API-wise, a compare method was also considered as
>>> it looked like just a short step forward from mismatch, however, it
>>> was eventually dropped since there is no convincing use case
>>> comparing files lexicographically by contents. Impl-wise, extensive
>>> performance benchmarking has been done to compare a buffered reading
>>> vs memory mapping, the result was that a simple buffered reading
>>> performed better among small files, and those with the mismatched
>>> byte closer to the beginning of files. Since the proposed method's
>>> targeted files are small ones, the impl currently does a buffered
>>> reading only.
>>
>> Hi Joe,
>>
>> Thanks for being persistent with this one!
>
> Thanks for the help, and the "lot of stuff" with clear indicators
> (e.g. line numbers and etc.)! From JDK 11 to 12, hopefully wont be
> deferred to 13 :-)
>>
>> A very small spec nitpick:
>>
>>> specdiff:
>>> http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files.html
>>
>> 1544 * <li> {@link #isSameFile(Path, Path) isSameFile(path,
>> path2)} returns true,</li>
>>
>> I would add "or" after the trailing comma. This makes it similar to
>> the two-item list that follows.
>
> Done.
>>
>>> webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/
>>
>> A couple minor comments on FilesMismatch.java:
>>
>> If mismatchByAttrs() is replaced with a call to isSameFile(), as Alan
>> suggested, this simplifies things considerably. It looks like the
>> mismatch() implementation will be reduced to around ~40 lines of
>> code. Does it deserve its own file, or can it be placed directly into
>> Files.java? That file has over 4,000 lines already though.
>
> I merged the private methods after removing mismatchByAttrs(), that
> reduced the code to 33 lines, and then with the change you suggested
> below, it's further reduced to 29 lines. I'll put them in Files for now.
>
> Files is too big, ~180K, that compares with the next one ~45K among
> files in the nio/file package. Unfortunately, the great majority of
> them are specs, probably 90%.
>>
>> 106 if (nRead1 == 0 || nRead2 == 0) {
>> 107 if (nRead1 == 0 && nRead2 == 0) {
>> 108 // both files reach EOF
>> 109 return -1;
>> 110 } else {
>> 111 // one of the files reaches EOF
>> 112 return totalRead;
>> 113 }
>> 114 } else if (nRead1 != nRead2) {
>>
>> I think it reads slightly better if the nested 'if' at lines 107-113
>> were flattened into the else-if chain, like so:
>>
>> if (nRead1 == 0 && nRead2 == 0) {
>> // both files reached EOF
>> return -1;
>> } else if (nRead1 == 0 || nRead2 == 0) {
>> // one but not both files reached EOF
>> return totalRead;
>> } else if (nRead1 != nRead2) {
>>
>> There are a couple places where lines can be joined. The resulting
>> lines are longer than 80 characters but are still less than 100
>> characters, which I think is OK. (I don't think we have a strict
>> limit of 80 anymore.)
>>
>> 97 private static long mismatch(InputStream in1, InputStream in2)
>> 98 throws IOException {
>>
>> 117 return totalRead +
>> 118 Arrays.mismatch(buffer1, 0, len, buffer2, 0,
>> len);
>
> Done. 95 characters :-)
>
>>
>> Comments on tests in Comparison.java:
>>
>> * This file, and several names within this file, should probably
>> should be renamed to focus on mismatch instead of comparison.
>
> Will rename to "mismatch". Previously, it covers both mismatch and
> compare.
>>
>> * I'm finding it quite difficult to read the test cases.
>>
>> The test files are placed into a Path[] and referenced by index. This
>> makes it quite difficult to determine, for example, if all six cases
>> within the mismatch loop are covered.
>>
>> Instead of storing the files at specific array indexes, I'd suggest
>> doing the following.
>>
>> 1) Establish good naming conventions for test files, including size
>> information, covering duplicate files (e.g., 'a' and 'b'), and a
>> modification at a particular offset. For example, you might do
>> something like the following:
>>
>> test120a - a test file with size 120
>> test120b - a test file with size 120, identical to test120a
>> test120m110 - a test file with size 120, with an altered byte at 110
>> etc.
>
> I can do that. I had hoped the Assertion msg, e.g. "Both files are
> empty, no mismatch" would help. Along with renaming variables, I will
> also add to the Assertion Msg with test case number, e.g. "Case#1:
> both files are empty, return no mismatch".
>>
>> 2) Declare fields of the same name containing the file paths.
>>
>> 3) Add some logic in prepareTestFile() to append each file's path to
>> a list after the file is created. Then have the @AfterClass method
>> run through this list and delete the files.
>
> May also create the test files in a folder and delete it @AfterClass.
>>
>> 4) The test data generator method can then have lines of test data
>> that reference the test files using a meaningful name, so it's easy
>> to understand the names and the expected results. Also, the test data
>> generator should have a comment that describes the test method
>> parameters.
>
> Sure, will add comments and etc.
>>
>> * The generation of the test data files seems overly complicated.
>>
>> A StringBuilder is used to accumulate data, then it's turned into a
>> String, then (optionally) a character is replaced by the insertChar()
>> method, and then the string is written to the file.
>>
>> I think you can just accumulate the test data into a StringBuilder,
>> (optionally) call sb.setCharAt() to modify the data, and then write
>> the SB to the file. After all, Files.writeString() takes a CharSequence.
>>
>> The generateFixedString() method should be changed to create and
>> return a fresh StringBuilder with the desired contents. (See above.)
>> It can also be simplified quite a bit.
>
> Wouldn't it be nice if String has an utility method that generates
> random ASCII strings? ;-) It can be faster with its internal byte array.
>>
>> I don't think you need to have methods with Charset parameters. It's
>> reasonable to have the test files be text, which makes things easier
>> to develop and debug. But the test depends on characters being
>> written in a single-byte charset, because mismatch() reports *byte*
>> positions and not *character* positions.
>>
>> Using the method overloads that are implicitly UTF-8 is incorrect. If
>> a non-single-byte character were to slip into the test data (seems
>> unlikely, but possible), it'll be expanded to multiple bytes by the
>> UTF-8 encoder, which may throw off the result. Instead I'd suggest
>> hardwiring the charset to be US_ASCII to enforce the
>> one-byte-per-character invariant.
>>
>> This is a lot of stuff. If you can't follow it all, perhaps we can
>> work together directly and go over the test file.
>
> I'll improve the test, and US_ASCII it is! I'll probably use a string
> of fixed length to generate the longest string the test needs, and
> then get substrings for other test cases, that may simplify the process.
>
> Thanks,
> Joe
>
>>
>> Thanks,
>>
>> s'marks
More information about the core-libs-dev
mailing list