RFR(JDK12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

newer
[12] RFR of JDK-8213576: Make test...

Joe Wang

19 Sep 2018 19 Sep '18

6:48 p.m.

Hi, After much discussion and 10 iterations of reviews, this proposal has evolved from what was the original isSameContent method to a mismatch method. API-wise, a compare method was also considered as it looked like just a short step forward from mismatch, however, it was eventually dropped since there is no convincing use case comparing files lexicographically by contents. Impl-wise, extensive performance benchmarking has been done to compare a buffered reading vs memory mapping, the result was that a simple buffered reading performed better among small files, and those with the mismatched byte closer to the beginning of files. Since the proposed method's targeted files are small ones, the impl currently does a buffered reading only. Please review. JBS: https://bugs.openjdk.java.net/browse/JDK-8202285 specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files... webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/ Thanks, Joe

Show replies by date

Alan Bateman

19 Sep 19 Sep

8:14 p.m.

On 19/09/2018 19:48, Joe Wang wrote:

...

Hi,

After much discussion and 10 iterations of reviews, this proposal has evolved from what was the original isSameContent method to a mismatch method. API-wise, a compare method was also considered as it looked like just a short step forward from mismatch, however, it was eventually dropped since there is no convincing use case comparing files lexicographically by contents. Impl-wise, extensive performance benchmarking has been done to compare a buffered reading vs memory mapping, the result was that a simple buffered reading performed better among small files, and those with the mismatched byte closer to the beginning of files. Since the proposed method's targeted files are small ones, the impl currently does a buffered reading only.

Please review.

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285

specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files...

webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/ Starting out using buffered I/O is probably okay but I assume we will want to change this in the future to having it use memory mapped I/O beyond a certain threshold.

Can you explain the use of toRealPath and comparing the names? That shouldn't be needed. Also the catching of ProviderMismatchExcepiton seems a bit strange too. Can you replace mismatchByAttrs with isSameFile? You could call this from Files.mismatch and then use the supporting implementation for the case that the files are not the same. -Alan

Brian Burkhalter

8:44 p.m.

On Sep 19, 2018, at 1:14 PM, Alan Bateman <Alan.Bateman@oracle.com> wrote:

...

Starting out using buffered I/O is probably okay but I assume we will want to change this in the future to having it use memory mapped I/O beyond a certain threshold.

For the buffered case, could there be any performance advantage to using direct buffers with FileChannel and comparing running checksums of the two sources? Brian

Brian Burkhalter

8:52 p.m.

On Sep 19, 2018, at 1:44 PM, Brian Burkhalter <brian.burkhalter@oracle.com> wrote:

...

On Sep 19, 2018, at 1:14 PM, Alan Bateman <Alan.Bateman@oracle.com> wrote:

...
Starting out using buffered I/O is probably okay but I assume we will want to change this in the future to having it use memory mapped I/O beyond a certain threshold.

For the buffered case, could there be any performance advantage to using direct buffers with FileChannel and comparing running checksums of the two sources?

Don’t think this would work however as two files with differing content could conceivable have the same checksum however likely that might be. Brian

Brian Burkhalter

8:53 p.m.

On Sep 19, 2018, at 1:52 PM, Brian Burkhalter <brian.burkhalter@oracle.com> wrote:

...

...
For the buffered case, could there be any performance advantage to using direct buffers with FileChannel and comparing running checksums of the two sources?

Don’t think this would work however as two files with differing content could conceivable have the same checksum however likely that might be.

s/conceivable /conceivably/. s/likely/unlikely/. Brian

Joe Wang

20 Sep 20 Sep

12:59 a.m.

On 9/19/18, 1:52 PM, Brian Burkhalter wrote:

...

On Sep 19, 2018, at 1:44 PM, Brian Burkhalter<brian.burkhalter@oracle.com> wrote:

...
On Sep 19, 2018, at 1:14 PM, Alan Bateman<Alan.Bateman@oracle.com> wrote:

...
Starting out using buffered I/O is probably okay but I assume we will want to change this in the future to having it use memory mapped I/O beyond a certain threshold. For the buffered case, could there be any performance advantage to using direct buffers with FileChannel and comparing running checksums of the two sources? Don’t think this would work however as two files with differing content could conceivable have the same checksum however likely that might be.

I'll do some performance testing for both cases, FileChannel/direct buffer and possible checksum comparison, although, as you said, the later might not work since we need to be 100%. The probability of false positives is likely extremely low, but is not zero after all. I'll do the performance check any ways just out of curiosity. But in many cases, checksum comparison may not be faster than direct comparison, esp. in this case the API compares just two files. Checksum values would be nice if there are multiple files to compare. Thanks, Joe

...

Brian

Brian Burkhalter

3:10 p.m.

On Sep 19, 2018, at 5:59 PM, Joe Wang <huizhe.wang@oracle.com> wrote:

...

I'll do some performance testing for both cases, FileChannel/direct buffer and possible checksum comparison, although, as you said, the later might not work since we need to be 100%. The probability of false positives is likely extremely low, but is not zero after all. I'll do the performance check any ways just out of curiosity. But in many cases, checksum comparison may not be faster than direct comparison, esp. in this case the API compares just two files. Checksum values would be nice if there are multiple files to compare.

I doubt the checksum test is worthwhile. It might be useful however to try FileChannel and direct byte buffers combined with ByteBuffer.mismatch() [1]. Thanks, Brian [1] https://download.java.net/java/early_access/jdk12/docs/api/java.base/java/ni...)

Joe Wang

6:08 p.m.

On 9/20/18, 8:10 AM, Brian Burkhalter wrote:

...

On Sep 19, 2018, at 5:59 PM, Joe Wang <huizhe.wang@oracle.com <mailto:huizhe.wang@oracle.com>> wrote:

...
I'll do some performance testing for both cases, FileChannel/direct buffer and possible checksum comparison, although, as you said, the later might not work since we need to be 100%. The probability of false positives is likely extremely low, but is not zero after all. I'll do the performance check any ways just out of curiosity. But in many cases, checksum comparison may not be faster than direct comparison, esp. in this case the API compares just two files. Checksum values would be nice if there are multiple files to compare.

I doubt the checksum test is worthwhile. It might be useful however to try FileChannel and direct byte buffers combined with ByteBuffer.mismatch() [1].

Ok, save the time not to run checksum testing :-) Will do the later, actually, I'm almost sure it will be FileChannel/ByteBuffer (better than the buffered reading). Thanks, Joe

...

Thanks,

Brian

[1] https://download.java.net/java/early_access/jdk12/docs/api/java.base/java/ni...) <https://download.java.net/java/early_access/jdk12/docs/api/java.base/java/nio/ByteBuffer.html#mismatch%28java.nio.ByteBuffer%29>

Roger Riggs

19 Sep 19 Sep

9:02 p.m.

Hi Alan, On 9/19/18 4:14 PM, Alan Bateman wrote:

...

On 19/09/2018 19:48, Joe Wang wrote:

...
Hi,

After much discussion and 10 iterations of reviews, this proposal has evolved from what was the original isSameContent method to a mismatch method. API-wise, a compare method was also considered as it looked like just a short step forward from mismatch, however, it was eventually dropped since there is no convincing use case comparing files lexicographically by contents. Impl-wise, extensive performance benchmarking has been done to compare a buffered reading vs memory mapping, the result was that a simple buffered reading performed better among small files, and those with the mismatched byte closer to the beginning of files. Since the proposed method's targeted files are small ones, the impl currently does a buffered reading only.

Please review.

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285

specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files...

webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/ Starting out using buffered I/O is probably okay but I assume we will want to change this in the future to having it use memory mapped I/O beyond a certain threshold. This came up in off-line discussions, it seems unlikely that two files will differ only in the last of 100Mb and it will require a separate code path that will very infrequently be exercised. So I'd still to a single technique even if it is slightly slower for very large files to keep the size of the code in check. If it shows up later as a performance problem it can be added.

$.02, Roger

...

Can you explain the use of toRealPath and comparing the names? That shouldn't be needed. Also the catching of ProviderMismatchExcepiton seems a bit strange too. Can you replace mismatchByAttrs with isSameFile? You could call this from Files.mismatch and then use the supporting implementation for the case that the files are not the same.

-Alan

Joe Wang

20 Sep 20 Sep

12:34 a.m.

On 9/19/18, 2:02 PM, Roger Riggs wrote:

...

Hi Alan,

On 9/19/18 4:14 PM, Alan Bateman wrote:

...
On 19/09/2018 19:48, Joe Wang wrote:

...
Hi,

After much discussion and 10 iterations of reviews, this proposal has evolved from what was the original isSameContent method to a mismatch method. API-wise, a compare method was also considered as it looked like just a short step forward from mismatch, however, it was eventually dropped since there is no convincing use case comparing files lexicographically by contents. Impl-wise, extensive performance benchmarking has been done to compare a buffered reading vs memory mapping, the result was that a simple buffered reading performed better among small files, and those with the mismatched byte closer to the beginning of files. Since the proposed method's targeted files are small ones, the impl currently does a buffered reading only.

Please review.

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285

specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files...

webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/ Starting out using buffered I/O is probably okay but I assume we will want to change this in the future to having it use memory mapped I/O beyond a certain threshold. This came up in off-line discussions, it seems unlikely that two files will differ only in the last of 100Mb and it will require a separate code path that will very infrequently be exercised. So I'd still to a single technique even if it is slightly slower for very large files to keep the size of the code in check. If it shows up later as a performance problem it can be added.

...

$.02, Roger

...
Can you explain the use of toRealPath and comparing the names? That shouldn't be needed. Also the catching of ProviderMismatchExcepiton seems a bit strange too. Can you replace mismatchByAttrs with isSameFile? You could call this from Files.mismatch and then use the supporting implementation for the case that the files are not the same.

The purpose of toRealPath is so that we can do a quick Path comparison by following symlinks if any, I'll add a test to compare a file and its symbolic link. ProviderMismatchExcepiton was added in one of the iterations when we defined UOE, shall be removed in the next update (I'll have an update after some performance work to compare with direct buffer is done). mismatchByAttrs cannot be replaced with isSameFile, it does one edge case check more than isSameFile when the size of one of the files is zero. mismatchByAttrs opens the files once vs twice if isSameFile and size(path) are called. Thanks, Joe

...

...
-Alan

Alan Bateman

7:07 a.m.

On 20/09/2018 01:34, Joe Wang wrote:

...

:

The purpose of toRealPath is so that we can do a quick Path comparison by following symlinks if any, I'll add a test to compare a file and its symbolic link. You don't need it. If two non-equals paths locate the same file then mismatch will do the right thing. By dropping it then you avoid a potentially expensive realpath (x2).

...

ProviderMismatchExcepiton was added in one of the iterations when we defined UOE, shall be removed in the next update (I'll have an update after some performance work to compare with direct buffer is done). Good.

...

mismatchByAttrs cannot be replaced with isSameFile, it does one edge case check more than isSameFile when the size of one of the files is zero. mismatchByAttrs opens the files once vs twice if isSameFile and size(path) are called.

Hmm, I think we make this a lot simpler. -Alan

Joe Wang

5:28 p.m.

On 9/20/18, 12:07 AM, Alan Bateman wrote:

...

On 20/09/2018 01:34, Joe Wang wrote:

...
:

The purpose of toRealPath is so that we can do a quick Path comparison by following symlinks if any, I'll add a test to compare a file and its symbolic link. You don't need it. If two non-equals paths locate the same file then mismatch will do the right thing. By dropping it then you avoid a potentially expensive realpath (x2).

...
ProviderMismatchExcepiton was added in one of the iterations when we defined UOE, shall be removed in the next update (I'll have an update after some performance work to compare with direct buffer is done). Good.

...
mismatchByAttrs cannot be replaced with isSameFile, it does one edge case check more than isSameFile when the size of one of the files is zero. mismatchByAttrs opens the files once vs twice if isSameFile and size(path) are called.

Hmm, I think we make this a lot simpler.

Ok, I'll call isSameFile instead, that would get rid of toRealPath too, to not let these edge case handling penalize the most, normal use cases. The other way around (a bit more cost to the edge cases) shall be fine. Thanks, Joe

...

-Alan

Alan Bateman

7:12 a.m.

On 19/09/2018 22:02, Roger Riggs wrote:

...

This came up in off-line discussions, it seems unlikely that two files will differ only in the last of 100Mb and it will require a separate code path that will very infrequently be exercised. So I'd still to a single technique even if it is slightly slower for very large files to keep the size of the code in check. If it shows up later as a performance problem it can be added.

I think this will eventually have a different implementation for the default file system where it can used memory mapping of file files. If the first 1MB of large files are identical then that it might switch over to mapping by chunk to compare the rest of the file. Starting out with a basic implementation is okay of course. -Alan.

Joe Wang

6:04 p.m.

On 9/20/18, 12:12 AM, Alan Bateman wrote:

...

On 19/09/2018 22:02, Roger Riggs wrote:

...
This came up in off-line discussions, it seems unlikely that two files will differ only in the last of 100Mb and it will require a separate code path that will very infrequently be exercised. So I'd still to a single technique even if it is slightly slower for very large files to keep the size of the code in check. If it shows up later as a performance problem it can be added.

I think this will eventually have a different implementation for the default file system where it can used memory mapping of file files. If the first 1MB of large files are identical then that it might switch over to mapping by chunk to compare the rest of the file. Starting out with a basic implementation is okay of course.

Sounds good. Will keep the basic impl, probably use FileChannel/ByteBuffer as Brian suggested, and look forward to the more complicated version if we hear any complaints. For the majority of text files, the basic impl would be enough. For example, the largest file in java.nio is about 180K. Thanks, Joe

...

-Alan.

Daniel Fuchs

9:36 a.m.

Hi Joe, The spec reads very well to me. On the implementation side: 114 } else if (nRead1 != nRead2) { 115 int len = Math.min(nRead1, nRead2); 116 // there's always a mismatch when nRead1 != nRead2 117 return totalRead + 118 Arrays.mismatch(buffer1, 0, len, buffer2, 0, len); I believe line 115 is not needed, and line 118 should be: Arrays.mismatch(buffer1, 0, nRead1, buffer2, 0, nRead2); otherwise - if I'm not mistaken - there's the chance that Arrays.mismatch(buffer1, 0, len, buffer2, 0, len); will return -1. Or am I missing something? If I'm right then your test is probably missing a case to explicitely test for that. Alternatively you could add a whitebox test to test FilesMismatch.mismatch(InputStream, InputStream) directly. best regards, -- daniel On 19/09/2018 19:48, Joe Wang wrote:

...

Hi,

After much discussion and 10 iterations of reviews, this proposal has evolved from what was the original isSameContent method to a mismatch method. API-wise, a compare method was also considered as it looked like just a short step forward from mismatch, however, it was eventually dropped since there is no convincing use case comparing files lexicographically by contents. Impl-wise, extensive performance benchmarking has been done to compare a buffered reading vs memory mapping, the result was that a simple buffered reading performed better among small files, and those with the mismatched byte closer to the beginning of files. Since the proposed method's targeted files are small ones, the impl currently does a buffered reading only.

Please review.

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285

specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files...

webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/

Thanks, Joe

Joe Wang

6:06 p.m.

Hi Daniel, You're right, line 115 is not needed. I'll add a couple of test cases to cover the route. Best, Joe On 9/20/18, 2:36 AM, Daniel Fuchs wrote:

...

Hi Joe,

The spec reads very well to me.

On the implementation side:

114 } else if (nRead1 != nRead2) { 115 int len = Math.min(nRead1, nRead2); 116 // there's always a mismatch when nRead1 != nRead2 117 return totalRead + 118 Arrays.mismatch(buffer1, 0, len, buffer2, 0, len);

I believe line 115 is not needed, and line 118 should be:

Arrays.mismatch(buffer1, 0, nRead1, buffer2, 0, nRead2);

otherwise - if I'm not mistaken - there's the chance that

Arrays.mismatch(buffer1, 0, len, buffer2, 0, len);

will return -1. Or am I missing something?

If I'm right then your test is probably missing a case to explicitely test for that. Alternatively you could add a whitebox test to test FilesMismatch.mismatch(InputStream, InputStream) directly.

best regards,

-- daniel

On 19/09/2018 19:48, Joe Wang wrote:

...
Hi,

After much discussion and 10 iterations of reviews, this proposal has evolved from what was the original isSameContent method to a mismatch method. API-wise, a compare method was also considered as it looked like just a short step forward from mismatch, however, it was eventually dropped since there is no convincing use case comparing files lexicographically by contents. Impl-wise, extensive performance benchmarking has been done to compare a buffered reading vs memory mapping, the result was that a simple buffered reading performed better among small files, and those with the mismatched byte closer to the beginning of files. Since the proposed method's targeted files are small ones, the impl currently does a buffered reading only.

Please review.

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285

specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files...

webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/

Thanks, Joe

Stuart Marks

9:46 p.m.

On 9/19/18 11:48 AM, Joe Wang wrote:

...

After much discussion and 10 iterations of reviews, this proposal has evolved from what was the original isSameContent method to a mismatch method. API-wise, a compare method was also considered as it looked like just a short step forward from mismatch, however, it was eventually dropped since there is no convincing use case comparing files lexicographically by contents. Impl-wise, extensive performance benchmarking has been done to compare a buffered reading vs memory mapping, the result was that a simple buffered reading performed better among small files, and those with the mismatched byte closer to the beginning of files. Since the proposed method's targeted files are small ones, the impl currently does a buffered reading only.

Hi Joe, Thanks for being persistent with this one! A very small spec nitpick:

...

specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files...

1544 * <li> {@link #isSameFile(Path, Path) isSameFile(path, path2)} returns true,</li> I would add "or" after the trailing comma. This makes it similar to the two-item list that follows.

...

webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/

A couple minor comments on FilesMismatch.java: If mismatchByAttrs() is replaced with a call to isSameFile(), as Alan suggested, this simplifies things considerably. It looks like the mismatch() implementation will be reduced to around ~40 lines of code. Does it deserve its own file, or can it be placed directly into Files.java? That file has over 4,000 lines already though. 106 if (nRead1 == 0 || nRead2 == 0) { 107 if (nRead1 == 0 && nRead2 == 0) { 108 // both files reach EOF 109 return -1; 110 } else { 111 // one of the files reaches EOF 112 return totalRead; 113 } 114 } else if (nRead1 != nRead2) { I think it reads slightly better if the nested 'if' at lines 107-113 were flattened into the else-if chain, like so: if (nRead1 == 0 && nRead2 == 0) { // both files reached EOF return -1; } else if (nRead1 == 0 || nRead2 == 0) { // one but not both files reached EOF return totalRead; } else if (nRead1 != nRead2) { There are a couple places where lines can be joined. The resulting lines are longer than 80 characters but are still less than 100 characters, which I think is OK. (I don't think we have a strict limit of 80 anymore.) 97 private static long mismatch(InputStream in1, InputStream in2) 98 throws IOException { 117 return totalRead + 118 Arrays.mismatch(buffer1, 0, len, buffer2, 0, len); Comments on tests in Comparison.java: * This file, and several names within this file, should probably should be renamed to focus on mismatch instead of comparison. * I'm finding it quite difficult to read the test cases. The test files are placed into a Path[] and referenced by index. This makes it quite difficult to determine, for example, if all six cases within the mismatch loop are covered. Instead of storing the files at specific array indexes, I'd suggest doing the following. 1) Establish good naming conventions for test files, including size information, covering duplicate files (e.g., 'a' and 'b'), and a modification at a particular offset. For example, you might do something like the following: test120a - a test file with size 120 test120b - a test file with size 120, identical to test120a test120m110 - a test file with size 120, with an altered byte at 110 etc. 2) Declare fields of the same name containing the file paths. 3) Add some logic in prepareTestFile() to append each file's path to a list after the file is created. Then have the @AfterClass method run through this list and delete the files. 4) The test data generator method can then have lines of test data that reference the test files using a meaningful name, so it's easy to understand the names and the expected results. Also, the test data generator should have a comment that describes the test method parameters. * The generation of the test data files seems overly complicated. A StringBuilder is used to accumulate data, then it's turned into a String, then (optionally) a character is replaced by the insertChar() method, and then the string is written to the file. I think you can just accumulate the test data into a StringBuilder, (optionally) call sb.setCharAt() to modify the data, and then write the SB to the file. After all, Files.writeString() takes a CharSequence. The generateFixedString() method should be changed to create and return a fresh StringBuilder with the desired contents. (See above.) It can also be simplified quite a bit. I don't think you need to have methods with Charset parameters. It's reasonable to have the test files be text, which makes things easier to develop and debug. But the test depends on characters being written in a single-byte charset, because mismatch() reports *byte* positions and not *character* positions. Using the method overloads that are implicitly UTF-8 is incorrect. If a non-single-byte character were to slip into the test data (seems unlikely, but possible), it'll be expanded to multiple bytes by the UTF-8 encoder, which may throw off the result. Instead I'd suggest hardwiring the charset to be US_ASCII to enforce the one-byte-per-character invariant. This is a lot of stuff. If you can't follow it all, perhaps we can work together directly and go over the test file. Thanks, s'marks

Joe Wang

22 Sep 22 Sep

1:49 a.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

On 9/20/18, 2:46 PM, Stuart Marks wrote:

...

On 9/19/18 11:48 AM, Joe Wang wrote:

...
After much discussion and 10 iterations of reviews, this proposal has evolved from what was the original isSameContent method to a mismatch method. API-wise, a compare method was also considered as it looked like just a short step forward from mismatch, however, it was eventually dropped since there is no convincing use case comparing files lexicographically by contents. Impl-wise, extensive performance benchmarking has been done to compare a buffered reading vs memory mapping, the result was that a simple buffered reading performed better among small files, and those with the mismatched byte closer to the beginning of files. Since the proposed method's targeted files are small ones, the impl currently does a buffered reading only.

Hi Joe,

Thanks for being persistent with this one!

Thanks for the help, and the "lot of stuff" with clear indicators (e.g. line numbers and etc.)! From JDK 11 to 12, hopefully wont be deferred to 13 :-)

...

A very small spec nitpick:

...
specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files...

1544 * <li> {@link #isSameFile(Path, Path) isSameFile(path, path2)} returns true,</li>

I would add "or" after the trailing comma. This makes it similar to the two-item list that follows.

Done.

...

...
webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/

A couple minor comments on FilesMismatch.java:

If mismatchByAttrs() is replaced with a call to isSameFile(), as Alan suggested, this simplifies things considerably. It looks like the mismatch() implementation will be reduced to around ~40 lines of code. Does it deserve its own file, or can it be placed directly into Files.java? That file has over 4,000 lines already though.

I merged the private methods after removing mismatchByAttrs(), that reduced the code to 33 lines, and then with the change you suggested below, it's further reduced to 29 lines. I'll put them in Files for now. Files is too big, ~180K, that compares with the next one ~45K among files in the nio/file package. Unfortunately, the great majority of them are specs, probably 90%.

...

106 if (nRead1 == 0 || nRead2 == 0) { 107 if (nRead1 == 0 && nRead2 == 0) { 108 // both files reach EOF 109 return -1; 110 } else { 111 // one of the files reaches EOF 112 return totalRead; 113 } 114 } else if (nRead1 != nRead2) {

I think it reads slightly better if the nested 'if' at lines 107-113 were flattened into the else-if chain, like so:

if (nRead1 == 0 && nRead2 == 0) { // both files reached EOF return -1; } else if (nRead1 == 0 || nRead2 == 0) { // one but not both files reached EOF return totalRead; } else if (nRead1 != nRead2) {

There are a couple places where lines can be joined. The resulting lines are longer than 80 characters but are still less than 100 characters, which I think is OK. (I don't think we have a strict limit of 80 anymore.)

97 private static long mismatch(InputStream in1, InputStream in2) 98 throws IOException {

117 return totalRead + 118 Arrays.mismatch(buffer1, 0, len, buffer2, 0, len);

Done. 95 characters :-)

...

Comments on tests in Comparison.java:

* This file, and several names within this file, should probably should be renamed to focus on mismatch instead of comparison.

Will rename to "mismatch". Previously, it covers both mismatch and compare.

...

* I'm finding it quite difficult to read the test cases.

The test files are placed into a Path[] and referenced by index. This makes it quite difficult to determine, for example, if all six cases within the mismatch loop are covered.

Instead of storing the files at specific array indexes, I'd suggest doing the following.

1) Establish good naming conventions for test files, including size information, covering duplicate files (e.g., 'a' and 'b'), and a modification at a particular offset. For example, you might do something like the following:

test120a - a test file with size 120 test120b - a test file with size 120, identical to test120a test120m110 - a test file with size 120, with an altered byte at 110 etc.

I can do that. I had hoped the Assertion msg, e.g. "Both files are empty, no mismatch" would help. Along with renaming variables, I will also add to the Assertion Msg with test case number, e.g. "Case#1: both files are empty, return no mismatch".

...

2) Declare fields of the same name containing the file paths.

3) Add some logic in prepareTestFile() to append each file's path to a list after the file is created. Then have the @AfterClass method run through this list and delete the files.

May also create the test files in a folder and delete it @AfterClass.

...

4) The test data generator method can then have lines of test data that reference the test files using a meaningful name, so it's easy to understand the names and the expected results. Also, the test data generator should have a comment that describes the test method parameters.

Sure, will add comments and etc.

...

* The generation of the test data files seems overly complicated.

A StringBuilder is used to accumulate data, then it's turned into a String, then (optionally) a character is replaced by the insertChar() method, and then the string is written to the file.

I think you can just accumulate the test data into a StringBuilder, (optionally) call sb.setCharAt() to modify the data, and then write the SB to the file. After all, Files.writeString() takes a CharSequence.

The generateFixedString() method should be changed to create and return a fresh StringBuilder with the desired contents. (See above.) It can also be simplified quite a bit.

Wouldn't it be nice if String has an utility method that generates random ASCII strings? ;-) It can be faster with its internal byte array.

...

I don't think you need to have methods with Charset parameters. It's reasonable to have the test files be text, which makes things easier to develop and debug. But the test depends on characters being written in a single-byte charset, because mismatch() reports *byte* positions and not *character* positions.

Using the method overloads that are implicitly UTF-8 is incorrect. If a non-single-byte character were to slip into the test data (seems unlikely, but possible), it'll be expanded to multiple bytes by the UTF-8 encoder, which may throw off the result. Instead I'd suggest hardwiring the charset to be US_ASCII to enforce the one-byte-per-character invariant.

This is a lot of stuff. If you can't follow it all, perhaps we can work together directly and go over the test file.

I'll improve the test, and US_ASCII it is! I'll probably use a string of fixed length to generate the longest string the test needs, and then get substrings for other test cases, that may simplify the process. Thanks, Joe

...

Thanks,

s'marks

Joe Wang

12 Oct 12 Oct

7:16 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Hi all, Here's an update based on all of the great reviews and comments (thanks all!): JBS: https://bugs.openjdk.java.net/browse/JDK-8202285 CSR: https://bugs.openjdk.java.net/browse/JDK-8202302 Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Fi... webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/ Previous version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v01/java/nio/file/F... webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v01/ It's been a while, so here's a summary of what's changed: Spec: Alan's patch to fix inconsistencies in wording/terminology Impl: s/mismatchByAttrs and etc/isSameFile; Stick with the simple implementation (InputStream/Buffer) for now. Further improvement may be done if there's a need; The impl is smaller than the previous version, merged the code into Files, removed FilesMismatch.java; Test: more readable variables instead of the array of test files more comments on the testcases improved the private methods used for generating the test files Thanks, Joe On 9/21/18, 6:49 PM, Joe Wang wrote:

...

On 9/20/18, 2:46 PM, Stuart Marks wrote:

...
On 9/19/18 11:48 AM, Joe Wang wrote:

...
After much discussion and 10 iterations of reviews, this proposal has evolved from what was the original isSameContent method to a mismatch method. API-wise, a compare method was also considered as it looked like just a short step forward from mismatch, however, it was eventually dropped since there is no convincing use case comparing files lexicographically by contents. Impl-wise, extensive performance benchmarking has been done to compare a buffered reading vs memory mapping, the result was that a simple buffered reading performed better among small files, and those with the mismatched byte closer to the beginning of files. Since the proposed method's targeted files are small ones, the impl currently does a buffered reading only.

Hi Joe,

Thanks for being persistent with this one!

Thanks for the help, and the "lot of stuff" with clear indicators (e.g. line numbers and etc.)! From JDK 11 to 12, hopefully wont be deferred to 13 :-)

...
A very small spec nitpick:

...
specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files...

1544 * <li> {@link #isSameFile(Path, Path) isSameFile(path, path2)} returns true,</li>

I would add "or" after the trailing comma. This makes it similar to the two-item list that follows.

Done.

...
...
webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/

A couple minor comments on FilesMismatch.java:

If mismatchByAttrs() is replaced with a call to isSameFile(), as Alan suggested, this simplifies things considerably. It looks like the mismatch() implementation will be reduced to around ~40 lines of code. Does it deserve its own file, or can it be placed directly into Files.java? That file has over 4,000 lines already though.

I merged the private methods after removing mismatchByAttrs(), that reduced the code to 33 lines, and then with the change you suggested below, it's further reduced to 29 lines. I'll put them in Files for now.

Files is too big, ~180K, that compares with the next one ~45K among files in the nio/file package. Unfortunately, the great majority of them are specs, probably 90%.

...
106 if (nRead1 == 0 || nRead2 == 0) { 107 if (nRead1 == 0 && nRead2 == 0) { 108 // both files reach EOF 109 return -1; 110 } else { 111 // one of the files reaches EOF 112 return totalRead; 113 } 114 } else if (nRead1 != nRead2) {

I think it reads slightly better if the nested 'if' at lines 107-113 were flattened into the else-if chain, like so:

if (nRead1 == 0 && nRead2 == 0) { // both files reached EOF return -1; } else if (nRead1 == 0 || nRead2 == 0) { // one but not both files reached EOF return totalRead; } else if (nRead1 != nRead2) {

There are a couple places where lines can be joined. The resulting lines are longer than 80 characters but are still less than 100 characters, which I think is OK. (I don't think we have a strict limit of 80 anymore.)

97 private static long mismatch(InputStream in1, InputStream in2) 98 throws IOException {

117 return totalRead + 118 Arrays.mismatch(buffer1, 0, len, buffer2, 0, len);

Done. 95 characters :-)

...
Comments on tests in Comparison.java:

* This file, and several names within this file, should probably should be renamed to focus on mismatch instead of comparison.

Will rename to "mismatch". Previously, it covers both mismatch and compare.

...
* I'm finding it quite difficult to read the test cases.

The test files are placed into a Path[] and referenced by index. This makes it quite difficult to determine, for example, if all six cases within the mismatch loop are covered.

Instead of storing the files at specific array indexes, I'd suggest doing the following.

1) Establish good naming conventions for test files, including size information, covering duplicate files (e.g., 'a' and 'b'), and a modification at a particular offset. For example, you might do something like the following:

test120a - a test file with size 120 test120b - a test file with size 120, identical to test120a test120m110 - a test file with size 120, with an altered byte at 110 etc.

I can do that. I had hoped the Assertion msg, e.g. "Both files are empty, no mismatch" would help. Along with renaming variables, I will also add to the Assertion Msg with test case number, e.g. "Case#1: both files are empty, return no mismatch".

...
2) Declare fields of the same name containing the file paths.

3) Add some logic in prepareTestFile() to append each file's path to a list after the file is created. Then have the @AfterClass method run through this list and delete the files.

May also create the test files in a folder and delete it @AfterClass.

...
4) The test data generator method can then have lines of test data that reference the test files using a meaningful name, so it's easy to understand the names and the expected results. Also, the test data generator should have a comment that describes the test method parameters.

Sure, will add comments and etc.

...
* The generation of the test data files seems overly complicated.

A StringBuilder is used to accumulate data, then it's turned into a String, then (optionally) a character is replaced by the insertChar() method, and then the string is written to the file.

I think you can just accumulate the test data into a StringBuilder, (optionally) call sb.setCharAt() to modify the data, and then write the SB to the file. After all, Files.writeString() takes a CharSequence.

The generateFixedString() method should be changed to create and return a fresh StringBuilder with the desired contents. (See above.) It can also be simplified quite a bit.

Wouldn't it be nice if String has an utility method that generates random ASCII strings? ;-) It can be faster with its internal byte array.

...
I don't think you need to have methods with Charset parameters. It's reasonable to have the test files be text, which makes things easier to develop and debug. But the test depends on characters being written in a single-byte charset, because mismatch() reports *byte* positions and not *character* positions.

Using the method overloads that are implicitly UTF-8 is incorrect. If a non-single-byte character were to slip into the test data (seems unlikely, but possible), it'll be expanded to multiple bytes by the UTF-8 encoder, which may throw off the result. Instead I'd suggest hardwiring the charset to be US_ASCII to enforce the one-byte-per-character invariant.

This is a lot of stuff. If you can't follow it all, perhaps we can work together directly and go over the test file.

I'll improve the test, and US_ASCII it is! I'll probably use a string of fixed length to generate the longest string the test needs, and then get substrings for other test cases, that may simplify the process.

Thanks, Joe

...
Thanks,

s'marks

Alan Bateman

14 Oct 14 Oct

6:08 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

On 12/10/2018 20:16, Joe Wang wrote:

...

Hi all,

Here's an update based on all of the great reviews and comments (thanks all!):

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285 CSR: https://bugs.openjdk.java.net/browse/JDK-8202302

Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Fi... webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/ I think the javadoc looks good now.

I agree that starting with a simple implementation is okay, it can be improved in time. One thing to fix is totalRead that needs to be a long to give the correct result on large files that mismatch at position > 2GB. -Alan

Joe Wang

15 Oct 15 Oct

10 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

On 10/14/18, 11:08 AM, Alan Bateman wrote:

...

On 12/10/2018 20:16, Joe Wang wrote:

...
Hi all,

Here's an update based on all of the great reviews and comments (thanks all!):

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285 CSR: https://bugs.openjdk.java.net/browse/JDK-8202302

Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Fi... webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/ I think the javadoc looks good now.

I agree that starting with a simple implementation is okay, it can be improved in time. One thing to fix is totalRead that needs to be a long to give the correct result on large files that mismatch at position > 2GB.

Fixed. -Joe

...

-Alan

Weijun Wang

1:48 a.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Hi Joe Two comments: 1. How about (path1, path2)? I take a look at other similar APIs, some use (c1,c2) and some (a,b). 2. Could the method be non-reflexive even if fs is non static? isSameFile(f,f) is always true. + * This method may not be atomic with respect to other file system + * operations. If the file system and files remain static then this method + * is reflexive (for {@code Path} {@code f}, {@code mismatch(f,f)} + * returns {@code -1L}) Thanks Max

...

On Oct 13, 2018, at 3:16 AM, Joe Wang <huizhe.wang@oracle.com> wrote:

Hi all,

Here's an update based on all of the great reviews and comments (thanks all!):

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285 CSR: https://bugs.openjdk.java.net/browse/JDK-8202302

Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Fi... webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/

Previous version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v01/java/nio/file/F... webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v01/

It's been a while, so here's a summary of what's changed:

Spec: Alan's patch to fix inconsistencies in wording/terminology Impl: s/mismatchByAttrs and etc/isSameFile; Stick with the simple implementation (InputStream/Buffer) for now. Further improvement may be done if there's a need; The impl is smaller than the previous version, merged the code into Files, removed FilesMismatch.java;

Test: more readable variables instead of the array of test files more comments on the testcases improved the private methods used for generating the test files

Thanks, Joe

On 9/21/18, 6:49 PM, Joe Wang wrote:

...
On 9/20/18, 2:46 PM, Stuart Marks wrote:

...
On 9/19/18 11:48 AM, Joe Wang wrote:

...
After much discussion and 10 iterations of reviews, this proposal has evolved from what was the original isSameContent method to a mismatch method. API-wise, a compare method was also considered as it looked like just a short step forward from mismatch, however, it was eventually dropped since there is no convincing use case comparing files lexicographically by contents. Impl-wise, extensive performance benchmarking has been done to compare a buffered reading vs memory mapping, the result was that a simple buffered reading performed better among small files, and those with the mismatched byte closer to the beginning of files. Since the proposed method's targeted files are small ones, the impl currently does a buffered reading only.

Hi Joe,

Thanks for being persistent with this one!

Thanks for the help, and the "lot of stuff" with clear indicators (e.g. line numbers and etc.)! From JDK 11 to 12, hopefully wont be deferred to 13 :-)

...
A very small spec nitpick:

...
specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files...

1544 * <li> {@link #isSameFile(Path, Path) isSameFile(path, path2)} returns true,</li>

I would add "or" after the trailing comma. This makes it similar to the two-item list that follows.

Done.

...
...
webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/

A couple minor comments on FilesMismatch.java:

If mismatchByAttrs() is replaced with a call to isSameFile(), as Alan suggested, this simplifies things considerably. It looks like the mismatch() implementation will be reduced to around ~40 lines of code. Does it deserve its own file, or can it be placed directly into Files.java? That file has over 4,000 lines already though.

I merged the private methods after removing mismatchByAttrs(), that reduced the code to 33 lines, and then with the change you suggested below, it's further reduced to 29 lines. I'll put them in Files for now.

Files is too big, ~180K, that compares with the next one ~45K among files in the nio/file package. Unfortunately, the great majority of them are specs, probably 90%.

...
106 if (nRead1 == 0 || nRead2 == 0) { 107 if (nRead1 == 0 && nRead2 == 0) { 108 // both files reach EOF 109 return -1; 110 } else { 111 // one of the files reaches EOF 112 return totalRead; 113 } 114 } else if (nRead1 != nRead2) {

I think it reads slightly better if the nested 'if' at lines 107-113 were flattened into the else-if chain, like so:

if (nRead1 == 0 && nRead2 == 0) { // both files reached EOF return -1; } else if (nRead1 == 0 || nRead2 == 0) { // one but not both files reached EOF return totalRead; } else if (nRead1 != nRead2) {

There are a couple places where lines can be joined. The resulting lines are longer than 80 characters but are still less than 100 characters, which I think is OK. (I don't think we have a strict limit of 80 anymore.)

97 private static long mismatch(InputStream in1, InputStream in2) 98 throws IOException {

117 return totalRead + 118 Arrays.mismatch(buffer1, 0, len, buffer2, 0, len);

Done. 95 characters :-)

...
Comments on tests in Comparison.java:

* This file, and several names within this file, should probably should be renamed to focus on mismatch instead of comparison.

Will rename to "mismatch". Previously, it covers both mismatch and compare.

...
* I'm finding it quite difficult to read the test cases.

The test files are placed into a Path[] and referenced by index. This makes it quite difficult to determine, for example, if all six cases within the mismatch loop are covered.

Instead of storing the files at specific array indexes, I'd suggest doing the following.

1) Establish good naming conventions for test files, including size information, covering duplicate files (e.g., 'a' and 'b'), and a modification at a particular offset. For example, you might do something like the following:

test120a - a test file with size 120 test120b - a test file with size 120, identical to test120a test120m110 - a test file with size 120, with an altered byte at 110 etc.

I can do that. I had hoped the Assertion msg, e.g. "Both files are empty, no mismatch" would help. Along with renaming variables, I will also add to the Assertion Msg with test case number, e.g. "Case#1: both files are empty, return no mismatch".

...
2) Declare fields of the same name containing the file paths.

3) Add some logic in prepareTestFile() to append each file's path to a list after the file is created. Then have the @AfterClass method run through this list and delete the files.

May also create the test files in a folder and delete it @AfterClass.

...
4) The test data generator method can then have lines of test data that reference the test files using a meaningful name, so it's easy to understand the names and the expected results. Also, the test data generator should have a comment that describes the test method parameters.

Sure, will add comments and etc.

...
* The generation of the test data files seems overly complicated.

A StringBuilder is used to accumulate data, then it's turned into a String, then (optionally) a character is replaced by the insertChar() method, and then the string is written to the file.

I think you can just accumulate the test data into a StringBuilder, (optionally) call sb.setCharAt() to modify the data, and then write the SB to the file. After all, Files.writeString() takes a CharSequence.

The generateFixedString() method should be changed to create and return a fresh StringBuilder with the desired contents. (See above.) It can also be simplified quite a bit.

Wouldn't it be nice if String has an utility method that generates random ASCII strings? ;-) It can be faster with its internal byte array.

...
I don't think you need to have methods with Charset parameters. It's reasonable to have the test files be text, which makes things easier to develop and debug. But the test depends on characters being written in a single-byte charset, because mismatch() reports *byte* positions and not *character* positions.

Using the method overloads that are implicitly UTF-8 is incorrect. If a non-single-byte character were to slip into the test data (seems unlikely, but possible), it'll be expanded to multiple bytes by the UTF-8 encoder, which may throw off the result. Instead I'd suggest hardwiring the charset to be US_ASCII to enforce the one-byte-per-character invariant.

This is a lot of stuff. If you can't follow it all, perhaps we can work together directly and go over the test file.

I'll improve the test, and US_ASCII it is! I'll probably use a string of fixed length to generate the longest string the test needs, and then get substrings for other test cases, that may simplify the process.

Thanks, Joe

...
Thanks,

s'marks

Joe Wang

10:14 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

On 10/14/18, 6:48 PM, Weijun Wang wrote:

...

Hi Joe

Two comments:

1. How about (path1, path2)? I take a look at other similar APIs, some use (c1,c2) and some (a,b).

It's true we have some inconsistencies there. Arrays.mismatch for example, named the parameters (a,b). Arguably though, they should have been (a, a2) if they wanted to maintain consistency with then existing "equals" methods. In our case, since Files.isSameFile was (path, path2), Files.mismatch(path, path2) is desirable since it is extending the former's functionality.

...

2. Could the method be non-reflexive even if fs is non static? isSameFile(f,f) is always true.

+ * This method may not be atomic with respect to other file system + * operations. If the file system and files remain static then this method + * isreflexive (for {@code Path} {@code f}, {@code mismatch(f,f)} + * returns {@code -1L})

You're right. The javadoc is updated now with the If statement moved to cover the symmetric case only. Current: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v03/java/nio/file/F... webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v03/ previous: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Fi... webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/ Thanks, Joe

...

Thanks Max

...
On Oct 13, 2018, at 3:16 AM, Joe Wang<huizhe.wang@oracle.com> wrote:

Hi all,

Here's an update based on all of the great reviews and comments (thanks all!):

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285 CSR: https://bugs.openjdk.java.net/browse/JDK-8202302

Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Fi... webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/

Previous version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v01/java/nio/file/F... webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v01/

It's been a while, so here's a summary of what's changed:

Spec: Alan's patch to fix inconsistencies in wording/terminology Impl: s/mismatchByAttrs and etc/isSameFile; Stick with the simple implementation (InputStream/Buffer) for now. Further improvement may be done if there's a need; The impl is smaller than the previous version, merged the code into Files, removed FilesMismatch.java;

Test: more readable variables instead of the array of test files more comments on the testcases improved the private methods used for generating the test files

Thanks, Joe

On 9/21/18, 6:49 PM, Joe Wang wrote:

...
On 9/20/18, 2:46 PM, Stuart Marks wrote:

...
On 9/19/18 11:48 AM, Joe Wang wrote:

...
After much discussion and 10 iterations of reviews, this proposal has evolved from what was the original isSameContent method to a mismatch method. API-wise, a compare method was also considered as it looked like just a short step forward from mismatch, however, it was eventually dropped since there is no convincing use case comparing files lexicographically by contents. Impl-wise, extensive performance benchmarking has been done to compare a buffered reading vs memory mapping, the result was that a simple buffered reading performed better among small files, and those with the mismatched byte closer to the beginning of files. Since the proposed method's targeted files are small ones, the impl currently does a buffered reading only. Hi Joe,

Thanks for being persistent with this one!

Thanks for the help, and the "lot of stuff" with clear indicators (e.g. line numbers and etc.)! From JDK 11 to 12, hopefully wont be deferred to 13 :-)

...
A very small spec nitpick:

...
specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff/java/nio/file/Files... 1544 *<li> {@link #isSameFile(Path, Path) isSameFile(path, path2)} returns true,</li>

I would add "or" after the trailing comma. This makes it similar to the two-item list that follows. Done.

...
webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev/ A couple minor comments on FilesMismatch.java:

If mismatchByAttrs() is replaced with a call to isSameFile(), as Alan suggested, this simplifies things considerably. It looks like the mismatch() implementation will be reduced to around ~40 lines of code. Does it deserve its own file, or can it be placed directly into Files.java? That file has over 4,000 lines already though. I merged the private methods after removing mismatchByAttrs(), that reduced the code to 33 lines, and then with the change you suggested below, it's further reduced to 29 lines. I'll put them in Files for now.

Files is too big, ~180K, that compares with the next one ~45K among files in the nio/file package. Unfortunately, the great majority of them are specs, probably 90%.

...
106 if (nRead1 == 0 || nRead2 == 0) { 107 if (nRead1 == 0&& nRead2 == 0) { 108 // both files reach EOF 109 return -1; 110 } else { 111 // one of the files reaches EOF 112 return totalRead; 113 } 114 } else if (nRead1 != nRead2) {

I think it reads slightly better if the nested 'if' at lines 107-113 were flattened into the else-if chain, like so:

if (nRead1 == 0&& nRead2 == 0) { // both files reached EOF return -1; } else if (nRead1 == 0 || nRead2 == 0) { // one but not both files reached EOF return totalRead; } else if (nRead1 != nRead2) {

There are a couple places where lines can be joined. The resulting lines are longer than 80 characters but are still less than 100 characters, which I think is OK. (I don't think we have a strict limit of 80 anymore.)

97 private static long mismatch(InputStream in1, InputStream in2) 98 throws IOException {

117 return totalRead + 118 Arrays.mismatch(buffer1, 0, len, buffer2, 0, len); Done. 95 characters :-)

...
Comments on tests in Comparison.java:

* This file, and several names within this file, should probably should be renamed to focus on mismatch instead of comparison. Will rename to "mismatch". Previously, it covers both mismatch and compare. * I'm finding it quite difficult to read the test cases.

The test files are placed into a Path[] and referenced by index. This makes it quite difficult to determine, for example, if all six cases within the mismatch loop are covered.

Instead of storing the files at specific array indexes, I'd suggest doing the following.

1) Establish good naming conventions for test files, including size information, covering duplicate files (e.g., 'a' and 'b'), and a modification at a particular offset. For example, you might do something like the following:

test120a - a test file with size 120 test120b - a test file with size 120, identical to test120a test120m110 - a test file with size 120, with an altered byte at 110 etc. I can do that. I had hoped the Assertion msg, e.g. "Both files are empty, no mismatch" would help. Along with renaming variables, I will also add to the Assertion Msg with test case number, e.g. "Case#1: both files are empty, return no mismatch". 2) Declare fields of the same name containing the file paths.

3) Add some logic in prepareTestFile() to append each file's path to a list after the file is created. Then have the @AfterClass method run through this list and delete the files. May also create the test files in a folder and delete it @AfterClass. 4) The test data generator method can then have lines of test data that reference the test files using a meaningful name, so it's easy to understand the names and the expected results. Also, the test data generator should have a comment that describes the test method parameters. Sure, will add comments and etc. * The generation of the test data files seems overly complicated.

A StringBuilder is used to accumulate data, then it's turned into a String, then (optionally) a character is replaced by the insertChar() method, and then the string is written to the file.

I think you can just accumulate the test data into a StringBuilder, (optionally) call sb.setCharAt() to modify the data, and then write the SB to the file. After all, Files.writeString() takes a CharSequence.

The generateFixedString() method should be changed to create and return a fresh StringBuilder with the desired contents. (See above.) It can also be simplified quite a bit. Wouldn't it be nice if String has an utility method that generates random ASCII strings? ;-) It can be faster with its internal byte array. I don't think you need to have methods with Charset parameters. It's reasonable to have the test files be text, which makes things easier to develop and debug. But the test depends on characters being written in a single-byte charset, because mismatch() reports *byte* positions and not *character* positions.

Using the method overloads that are implicitly UTF-8 is incorrect. If a non-single-byte character were to slip into the test data (seems unlikely, but possible), it'll be expanded to multiple bytes by the UTF-8 encoder, which may throw off the result. Instead I'd suggest hardwiring the charset to be US_ASCII to enforce the one-byte-per-character invariant.

This is a lot of stuff. If you can't follow it all, perhaps we can work together directly and go over the test file. I'll improve the test, and US_ASCII it is! I'll probably use a string of fixed length to generate the longest string the test needs, and then get substrings for other test cases, that may simplify the process.

Thanks, Joe

...
Thanks,

s'marks

Weijun Wang

16 Oct 16 Oct

12:39 a.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

...

On Oct 16, 2018, at 6:14 AM, Joe Wang <huizhe.wang@oracle.com> wrote:

On 10/14/18, 6:48 PM, Weijun Wang wrote:

...
Hi Joe

Two comments:

1. How about (path1, path2)? I take a look at other similar APIs, some use (c1,c2) and some (a,b).

It's true we have some inconsistencies there. Arrays.mismatch for example, named the parameters (a,b). Arguably though, they should have been (a, a2) if they wanted to maintain consistency with then existing "equals" methods. In our case, since Files.isSameFile was (path, path2), Files.mismatch(path, path2) is desirable since it is extending the former's functionality.

Oops, I didn't notice this one. You're right. Thanks Max

Daniel Fuchs

9:11 a.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Hi Joe, There are a few places where {@linkplain } should be used instead of {@link }. For instance: 1545 * <li> The two paths locate the {@link #isSameFile(Path, Path) same file}, This should be {@linkplain } - for the record {@link } will format the text as code, {@linkplain } will format it as regular text. Since the text of the link is "same file" then it should be formatted as regular text - not as code. The link on the next line should probably be {@linkplain } as well. Otherwise looks good to me (except for the declaration of int totalRead that should be long but Alan already mentioned it). best regards, -- daniel On 12/10/2018 20:16, Joe Wang wrote:

...

Here's an update based on all of the great reviews and comments (thanks all!):

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285 CSR: https://bugs.openjdk.java.net/browse/JDK-8202302

Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Fi...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/

Previous version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v01/java/nio/file/F...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v01/

It's been a while, so here's a summary of what's changed:

Spec: Alan's patch to fix inconsistencies in wording/terminology Impl: s/mismatchByAttrs and etc/isSameFile; Stick with the simple implementation (InputStream/Buffer) for now. Further improvement may be done if there's a need; The impl is smaller than the previous version, merged the code into Files, removed FilesMismatch.java;

Test: more readable variables instead of the array of test files more comments on the testcases improved the private methods used for generating the test files

Thanks, Joe

Joe Wang

6:10 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Hi Daniel, @linkplain it is! Thanks! And yes, int totalRead was fixed in the previous webrevs. Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v04/java/nio/file/F... webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v04/ Previous version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v03/java/nio/file/F... webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v03/ Best regards, Joe On 10/16/18, 2:11 AM, Daniel Fuchs wrote:

...

Hi Joe,

There are a few places where {@linkplain } should be used instead of {@link }. For instance:

1545 * <li> The two paths locate the {@link #isSameFile(Path, Path) same file},

This should be {@linkplain } - for the record {@link } will format the text as code, {@linkplain } will format it as regular text. Since the text of the link is "same file" then it should be formatted as regular text - not as code.

The link on the next line should probably be {@linkplain } as well.

Otherwise looks good to me (except for the declaration of int totalRead that should be long but Alan already mentioned it).

best regards,

-- daniel

On 12/10/2018 20:16, Joe Wang wrote:

...
Here's an update based on all of the great reviews and comments (thanks all!):

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285 CSR: https://bugs.openjdk.java.net/browse/JDK-8202302

Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Fi...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/

Previous version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v01/java/nio/file/F...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v01/

It's been a while, so here's a summary of what's changed:

Spec: Alan's patch to fix inconsistencies in wording/terminology Impl: s/mismatchByAttrs and etc/isSameFile; Stick with the simple implementation (InputStream/Buffer) for now. Further improvement may be done if there's a need; The impl is smaller than the previous version, merged the code into Files, removed FilesMismatch.java;

Test: more readable variables instead of the array of test files more comments on the testcases improved the private methods used for generating the test files

Thanks, Joe

Alan Bateman

6:31 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

On 16/10/2018 19:10, Joe Wang wrote:

...

Hi Daniel,

@linkplain it is! Thanks! And yes, int totalRead was fixed in the previous webrevs.

Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v04/java/nio/file/F...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v04/ The javadoc and starting implementation looks okay now. I haven't had time to go over the test so hopefully someone else has time to look at that.

-Alan

Roger Riggs

17 Oct 17 Oct

2:16 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Hi Joe, Looks good, straightforward and easy to understand. The spec text in the CSR need to be updated to match. (At least the paragraph about reflexive and atomic). Thanks, Roger On 10/16/2018 02:10 PM, Joe Wang wrote:

...

Hi Daniel,

@linkplain it is! Thanks! And yes, int totalRead was fixed in the previous webrevs.

Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v04/java/nio/file/F...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v04/

Previous version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v03/java/nio/file/F...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v03/

Best regards, Joe

On 10/16/18, 2:11 AM, Daniel Fuchs wrote:

...
Hi Joe,

There are a few places where {@linkplain } should be used instead of {@link }. For instance:

1545 * <li> The two paths locate the {@link #isSameFile(Path, Path) same file},

This should be {@linkplain } - for the record {@link } will format the text as code, {@linkplain } will format it as regular text. Since the text of the link is "same file" then it should be formatted as regular text - not as code.

The link on the next line should probably be {@linkplain } as well.

Otherwise looks good to me (except for the declaration of int totalRead that should be long but Alan already mentioned it).

best regards,

-- daniel

On 12/10/2018 20:16, Joe Wang wrote:

...
Here's an update based on all of the great reviews and comments (thanks all!):

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285 CSR: https://bugs.openjdk.java.net/browse/JDK-8202302

Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Fi...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/

Previous version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v01/java/nio/file/F...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v01/

It's been a while, so here's a summary of what's changed:

Spec: Alan's patch to fix inconsistencies in wording/terminology Impl: s/mismatchByAttrs and etc/isSameFile; Stick with the simple implementation (InputStream/Buffer) for now. Further improvement may be done if there's a need; The impl is smaller than the previous version, merged the code into Files, removed FilesMismatch.java;

Test: more readable variables instead of the array of test files more comments on the testcases improved the private methods used for generating the test files

Thanks, Joe

-- Thanks, Roger

Joe Wang

7:33 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

On 10/17/18, 7:16 AM, Roger Riggs wrote:

...

Hi Joe,

Looks good, straightforward and easy to understand.

Glad to hear that, while it's been back and forth a couple of times, it seems "straightforward and easy" won :-)

...

The spec text in the CSR need to be updated to match. (At least the paragraph about reflexive and atomic).

CSR: update the text and attachment https://bugs.openjdk.java.net/browse/JDK-8202302 Also added coverage to spec case 5 - reflexive (no new test cases), spec case 6 - symmetric with tests selected from case 3. webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v05/ Thanks, Joe

...

Thanks, Roger

On 10/16/2018 02:10 PM, Joe Wang wrote:

...
Hi Daniel,

@linkplain it is! Thanks! And yes, int totalRead was fixed in the previous webrevs.

Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v04/java/nio/file/F...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v04/

Previous version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v03/java/nio/file/F...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v03/

Best regards, Joe

On 10/16/18, 2:11 AM, Daniel Fuchs wrote:

...
Hi Joe,

There are a few places where {@linkplain } should be used instead of {@link }. For instance:

1545 * <li> The two paths locate the {@link #isSameFile(Path, Path) same file},

This should be {@linkplain } - for the record {@link } will format the text as code, {@linkplain } will format it as regular text. Since the text of the link is "same file" then it should be formatted as regular text - not as code.

The link on the next line should probably be {@linkplain } as well.

Otherwise looks good to me (except for the declaration of int totalRead that should be long but Alan already mentioned it).

best regards,

-- daniel

On 12/10/2018 20:16, Joe Wang wrote:

...
Here's an update based on all of the great reviews and comments (thanks all!):

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285 CSR: https://bugs.openjdk.java.net/browse/JDK-8202302

Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Fi...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/

Previous version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v01/java/nio/file/F...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v01/

It's been a while, so here's a summary of what's changed:

Spec: Alan's patch to fix inconsistencies in wording/terminology Impl: s/mismatchByAttrs and etc/isSameFile; Stick with the simple implementation (InputStream/Buffer) for now. Further improvement may be done if there's a need; The impl is smaller than the previous version, merged the code into Files, removed FilesMismatch.java;

Test: more readable variables instead of the array of test files more comments on the testcases improved the private methods used for generating the test files

Thanks, Joe

Roger Riggs

8:08 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Hi Joe, A few comments about the test. 45: the summary should mention it is testing the Files.mismatch method I think the @bug line should be empty, this is not a test for a bug. I don't think you need or use the testng group= functionality; if you don't have a need for it omit it. On the test files, I think I would have used testMismatch DataProvider to create the files. It would avoid having to maintain parallel lists of matching parameters. The @beforeSetup might support the dataProvider directly, but might have to iterate over the array itself, creating the files if they do not exist. And keeping the mapping from name to path in a Map instead of statics. The code should use Assert.assertEquals(actual, expected, msg) so that if there is a difference it is printed. (instead of assertTrue ( n==m)...) The file sizes all seem to be multiples of 1024 or the buffer size. There might be a good case for using a more varied sizes. 340: testMismatchNotExists: can the expected exception be more specific: FileNotFoundException instead of IOException. 370: This condition for zero made me wonder if there was a test case for files that differ at 0. 397: fillArray could probably make good use of System.arraycopy(). Regards, Roger On 10/17/2018 03:33 PM, Joe Wang wrote:

...

On 10/17/18, 7:16 AM, Roger Riggs wrote:

...
Hi Joe,

Looks good, straightforward and easy to understand.

Glad to hear that, while it's been back and forth a couple of times, it seems "straightforward and easy" won :-)

...
The spec text in the CSR need to be updated to match. (At least the paragraph about reflexive and atomic).

CSR: update the text and attachment https://bugs.openjdk.java.net/browse/JDK-8202302

Also added coverage to spec case 5 - reflexive (no new test cases), spec case 6 - symmetric with tests selected from case 3. webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v05/

Thanks, Joe

...
Thanks, Roger

On 10/16/2018 02:10 PM, Joe Wang wrote:

...
Hi Daniel,

@linkplain it is! Thanks! And yes, int totalRead was fixed in the previous webrevs.

Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v04/java/nio/file/F...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v04/

Previous version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v03/java/nio/file/F...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v03/

Best regards, Joe

On 10/16/18, 2:11 AM, Daniel Fuchs wrote:

...
Hi Joe,

There are a few places where {@linkplain } should be used instead of {@link }. For instance:

1545 * <li> The two paths locate the {@link #isSameFile(Path, Path) same file},

This should be {@linkplain } - for the record {@link } will format the text as code, {@linkplain } will format it as regular text. Since the text of the link is "same file" then it should be formatted as regular text - not as code.

The link on the next line should probably be {@linkplain } as well.

Otherwise looks good to me (except for the declaration of int totalRead that should be long but Alan already mentioned it).

best regards,

-- daniel

On 12/10/2018 20:16, Joe Wang wrote:

...
Here's an update based on all of the great reviews and comments (thanks all!):

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285 CSR: https://bugs.openjdk.java.net/browse/JDK-8202302

Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Fi...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/

Previous version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v01/java/nio/file/F...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v01/

It's been a while, so here's a summary of what's changed:

Spec: Alan's patch to fix inconsistencies in wording/terminology Impl: s/mismatchByAttrs and etc/isSameFile; Stick with the simple implementation (InputStream/Buffer) for now. Further improvement may be done if there's a need; The impl is smaller than the previous version, merged the code into Files, removed FilesMismatch.java;

Test: more readable variables instead of the array of test files more comments on the testcases improved the private methods used for generating the test files

Thanks, Joe

-- Thanks, Roger

Stuart Marks

9:01 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

On 10/17/18 1:08 PM, Roger Riggs wrote:

...

I think the @bug line should be empty, this is not a test for a bug. I don't think you need or use the testng group= functionality; if you don't have a need for it omit it.

I think it would be helpful to have the bug ID in the test's @bug line. If in the future this test fails or needs to be modified or something, this points to the bug under which the changeset was pushed. This bug ID can also be used to help find the email review thread. It also links to the CSR should any questions about the specification arise. s'marks

Joe Wang

19 Oct 19 Oct

6:10 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

On 10/17/18, 1:08 PM, Roger Riggs wrote:

...

Hi Joe,

A few comments about the test.

45: the summary should mention it is testing the Files.mismatch method

Files.mismatch it is.

...

I think the @bug line should be empty, this is not a test for a bug.

Consider along with Stuart's comment, that it points to the bug under which the changeset was pushed, would it be ok to keep it?

...

I don't think you need or use the testng group= functionality; if you don't have a need for it omit it.

Removed.

...

On the test files, I think I would have used testMismatch DataProvider to create the files. It would avoid having to maintain parallel lists of matching parameters. The @beforeSetup might support the dataProvider directly, but might have to iterate over the array itself, creating the files if they do not exist. And keeping the mapping from name to path in a Map instead of statics.

Removed the class fields. They were converted from an array that was used for @AfterClass cleanup, now that I've added a temporary test directory, cleanup won't need them anymore. Not much sharing among the DataProviders either. So now they are created inside each DataProvider as needed.

...

The code should use Assert.assertEquals(actual, expected, msg) so that if there is a difference it is printed. (instead of assertTrue ( n==m)...)

Done.

...

The file sizes all seem to be multiples of 1024 or the buffer size. There might be a good case for using a more varied sizes.

Yeah, added a couple of more test groups with arbitrary sizes.

...

340: testMismatchNotExists: can the expected exception be more specific: FileNotFoundException instead of IOException.

It appears isSameFile actually throws java.nio.file.NoSuchFileException rather than FileNotFoundException. Since isSameFile, as well as this spec, defined IOException, the expected exception spec-wise and impl-wise (it uses isSameFile) is indeed IOException.

...

370: This condition for zero made me wonder if there was a test case for files that differ at 0.

Added tests for a mismatch at 0, and also at size - 1 (at the end).

...

397: fillArray could probably make good use of System.arraycopy().

Sure. Current version: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v06/ Previous: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v05/ Thanks, Joe

...

Regards, Roger

On 10/17/2018 03:33 PM, Joe Wang wrote:

...
On 10/17/18, 7:16 AM, Roger Riggs wrote:

...
Hi Joe,

Looks good, straightforward and easy to understand.

Glad to hear that, while it's been back and forth a couple of times, it seems "straightforward and easy" won :-)

...
The spec text in the CSR need to be updated to match. (At least the paragraph about reflexive and atomic).

CSR: update the text and attachment https://bugs.openjdk.java.net/browse/JDK-8202302

Also added coverage to spec case 5 - reflexive (no new test cases), spec case 6 - symmetric with tests selected from case 3. webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v05/

Thanks, Joe

...
Thanks, Roger

On 10/16/2018 02:10 PM, Joe Wang wrote:

...
Hi Daniel,

@linkplain it is! Thanks! And yes, int totalRead was fixed in the previous webrevs.

Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v04/java/nio/file/F...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v04/

Previous version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v03/java/nio/file/F...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v03/

Best regards, Joe

On 10/16/18, 2:11 AM, Daniel Fuchs wrote:

...
Hi Joe,

There are a few places where {@linkplain } should be used instead of {@link }. For instance:

1545 * <li> The two paths locate the {@link #isSameFile(Path, Path) same file},

This should be {@linkplain } - for the record {@link } will format the text as code, {@linkplain } will format it as regular text. Since the text of the link is "same file" then it should be formatted as regular text - not as code.

The link on the next line should probably be {@linkplain } as well.

Otherwise looks good to me (except for the declaration of int totalRead that should be long but Alan already mentioned it).

best regards,

-- daniel

On 12/10/2018 20:16, Joe Wang wrote:

...
Here's an update based on all of the great reviews and comments (thanks all!):

JBS: https://bugs.openjdk.java.net/browse/JDK-8202285 CSR: https://bugs.openjdk.java.net/browse/JDK-8202302

Current version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_02/java/nio/file/Fi...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v02/

Previous version: specdiff: http://cr.openjdk.java.net/~joehw/jdk12/8202285/specdiff_v01/java/nio/file/F...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v01/

It's been a while, so here's a summary of what's changed:

Spec: Alan's patch to fix inconsistencies in wording/terminology Impl: s/mismatchByAttrs and etc/isSameFile; Stick with the simple implementation (InputStream/Buffer) for now. Further improvement may be done if there's a need; The impl is smaller than the previous version, merged the code into Files, removed FilesMismatch.java;

Test: more readable variables instead of the array of test files more comments on the testcases improved the private methods used for generating the test files

Thanks, Joe

-- Thanks, Roger

Stuart Marks

17 Oct 17 Oct

10:03 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Hi Joe,

...

webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v05/

I concur with Roger's comments on the test file generation. It would be good to find some way to reduce the redundancy. He had some suggestions that might work. You might also want to play around with variations to see how they look. 375 byte[] bytes = "ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 0123456789 \n".getBytes(); A small thing, but this uses the default charset, which is platform-specific. I think it would be good to use StandardCharsets.US_ASCII here. Once you have this in bytes, it looks like all the other parts of the test deal with bytes (not characters) so I don't think there are any other concerns about charsets and such. 345 /** 346 * Creates a file with alphanumeric content with one character altered 347 * at the specified position. 348 * 349 * @param dir the directory in which the file is to be created 350 * @param purpose the purpose of the file 351 * @param size the size of the file 352 * @param pos the position where the alternative char is to be added. If it 353 * is smaller than zero, no alternation shall be made. 354 * @param c the character 355 * @return path of the created file 356 * @throws IOException 357 */ 358 private static Path createANFile(Path dir, String purpose, int size, int pos, 359 byte c) throws IOException { 360 Path path = Files.createFile(Paths.get(dir.toString(), purpose + ".txt")); 361 if (size > 0) { 362 writeAlphanumericFile(path, size, pos, c); 363 } 364 return path; 365 } The problem with documenting things well is that the doc gets out of date. :-) (Well, actually somewhat serious.) This method creates test data files in terms of bytes (although the bytes happen to be derived from ASCII characters). The altered data is a byte at a particular byte offset, not a character. Several places in the doc need to be updated to reflect this, since they still talk about characters. Even though the altered data is a byte, for the convenience of the caller, the parameter should be a char or an int. All the places where this is called use a char literal casted to a byte: 125 test1024m1014a = createANFile(testDir, "test1024m1014a", size, size - 10, (byte)'@'); If the parameter were a char or a int, the caller could just do: 125 test1024m1014a = createANFile(testDir, "test1024m1014a", size, size - 10, '@'); The byte used to alter the test data could be the low order 7 bits (since we're using ASCII) of the parameter value. Either document it this way and simply mask off the high order bits of the parameter value, or assert that those high order bits are all zeroes. In writeAlphanumericFile(), 370 if (pos > 0) a[pos] = c; This should check should be for pos >= 0, since altering a byte at position 0 is a valid test case. You might want to add this as another test case. s'marks

Joe Wang

19 Oct 19 Oct

6:26 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

On 10/17/18, 3:03 PM, Stuart Marks wrote:

...

Hi Joe,

...
webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v05/

I concur with Roger's comments on the test file generation. It would be good to find some way to reduce the redundancy. He had some suggestions that might work. You might also want to play around with variations to see how they look.

375 byte[] bytes = "ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 0123456789 \n".getBytes();

A small thing, but this uses the default charset, which is platform-specific. I think it would be good to use StandardCharsets.US_ASCII here.

ASCII it is.

...

Once you have this in bytes, it looks like all the other parts of the test deal with bytes (not characters) so I don't think there are any other concerns about charsets and such.

345 /** 346 * Creates a file with alphanumeric content with one character altered 347 * at the specified position. 348 * 349 * @param dir the directory in which the file is to be created 350 * @param purpose the purpose of the file 351 * @param size the size of the file 352 * @param pos the position where the alternative char is to be added. If it 353 * is smaller than zero, no alternation shall be made. 354 * @param c the character 355 * @return path of the created file 356 * @throws IOException 357 */ 358 private static Path createANFile(Path dir, String purpose, int size, int pos, 359 byte c) throws IOException { 360 Path path = Files.createFile(Paths.get(dir.toString(), purpose + ".txt")); 361 if (size > 0) { 362 writeAlphanumericFile(path, size, pos, c); 363 } 364 return path; 365 }

The problem with documenting things well is that the doc gets out of date. :-)

Very true, now I get to keep the doc as we switch back to char :-)

...

(Well, actually somewhat serious.)

This method creates test data files in terms of bytes (although the bytes happen to be derived from ASCII characters). The altered data is a byte at a particular byte offset, not a character. Several places in the doc need to be updated to reflect this, since they still talk about characters.

Even though the altered data is a byte, for the convenience of the caller, the parameter should be a char or an int. All the places where this is called use a char literal casted to a byte:

125 test1024m1014a = createANFile(testDir, "test1024m1014a", size, size - 10, (byte)'@');

If the parameter were a char or a int, the caller could just do:

125 test1024m1014a = createANFile(testDir, "test1024m1014a", size, size - 10, '@');

The byte used to alter the test data could be the low order 7 bits (since we're using ASCII) of the parameter value. Either document it this way and simply mask off the high order bits of the parameter value, or assert that those high order bits are all zeroes.

Sounds good, the latter looks better with a char. It's a plus as I can keep the statement that says it alters one character. While only the low order byte matters for this test data (ASCII only), it's masked off to be clear about the intention.

...

In writeAlphanumericFile(),

370 if (pos > 0) a[pos] = c;

This should check should be for pos >= 0, since altering a byte at position 0 is a valid test case. You might want to add this as another test case.

Changed to >=0, and added test cases for an altered byte at position 0 (as well as at the end). Current version: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v06/ Previous: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v05/ Thanks, Joe

...

s'marks

Stuart Marks

6 Nov 6 Nov

4:27 a.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

On 10/19/18 11:26 AM, Joe Wang wrote:

...

Current version: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v06/

Hi Joe, Thanks for updating the tests per my comments. Everything looks good now! s'marks

Roger Riggs

4:55 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

+1 Though with 65 tests, I suspect that there are more cases than strictly needed to cover all the code flows. For example, three cases for testing when it is the same file doesn't seem necessary. Regards, Roger On 11/05/2018 11:27 PM, Stuart Marks wrote:

...

On 10/19/18 11:26 AM, Joe Wang wrote:

...
Current version: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v06/

Hi Joe,

Thanks for updating the tests per my comments. Everything looks good now!

s'marks

Andrew Luo

6:27 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

I'm not a reviewer, but just wanted to point out that the fully qualified Files.BUFFER_SIZE can be simplified to just BUFFER_SIZE. Searching through the code in Files.java I don't see many other instances of using Files.* (although I do see a few). Thanks Andrew -----Original Message----- From: core-libs-dev <core-libs-dev-bounces@openjdk.java.net> On Behalf Of Roger Riggs Sent: Tuesday, November 6, 2018 8:56 AM To: core-libs-dev@openjdk.java.net Subject: Re: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents +1 Though with 65 tests, I suspect that there are more cases than strictly needed to cover all the code flows. For example, three cases for testing when it is the same file doesn't seem necessary. Regards, Roger On 11/05/2018 11:27 PM, Stuart Marks wrote:

...

On 10/19/18 11:26 AM, Joe Wang wrote:

...
Current version: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v06/

Hi Joe,

Thanks for updating the tests per my comments. Everything looks good now!

s'marks

Joe Wang

8:53 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Thanks Andrew for pointing that out. It wasn't intentional, a result of copy n paste after removing the support class. It's fixed now: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v07/ -Joe On 11/6/18, 10:27 AM, Andrew Luo wrote:

...

I'm not a reviewer, but just wanted to point out that the fully qualified Files.BUFFER_SIZE can be simplified to just BUFFER_SIZE. Searching through the code in Files.java I don't see many other instances of using Files.* (although I do see a few).

Thanks Andrew

-----Original Message----- From: core-libs-dev<core-libs-dev-bounces@openjdk.java.net> On Behalf Of Roger Riggs Sent: Tuesday, November 6, 2018 8:56 AM To: core-libs-dev@openjdk.java.net Subject: Re: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

+1

Though with 65 tests, I suspect that there are more cases than strictly needed to cover all the code flows. For example, three cases for testing when it is the same file doesn't seem necessary.

Regards, Roger

On 11/05/2018 11:27 PM, Stuart Marks wrote:

...
On 10/19/18 11:26 AM, Joe Wang wrote:

...
Current version: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v06/ Hi Joe,

Thanks for updating the tests per my comments. Everything looks good now!

s'marks

Ivan Gerasimov

7 Nov 7 Nov

12:21 a.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Hi Joe! I apologize, if it was already discussed before. Why do you need to have 3 edge cases of nRead1 and nRead2 in lines 1596-1605: Was it to save a call to Arrays.mismatch()? If I'm not mistaken, the code would work equally well, if lines 1596-1615 were replaced with just subrange 1606-1614. With kind regards, Ivan On 11/6/18 12:53 PM, Joe Wang wrote:

...

Thanks Andrew for pointing that out. It wasn't intentional, a result of copy n paste after removing the support class. It's fixed now: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v07/

-Joe

On 11/6/18, 10:27 AM, Andrew Luo wrote:

...
I'm not a reviewer, but just wanted to point out that the fully qualified Files.BUFFER_SIZE can be simplified to just BUFFER_SIZE. Searching through the code in Files.java I don't see many other instances of using Files.* (although I do see a few).

Thanks Andrew

-----Original Message----- From: core-libs-dev<core-libs-dev-bounces@openjdk.java.net> On Behalf Of Roger Riggs Sent: Tuesday, November 6, 2018 8:56 AM To: core-libs-dev@openjdk.java.net Subject: Re: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

+1

Though with 65 tests, I suspect that there are more cases than strictly needed to cover all the code flows. For example, three cases for testing when it is the same file doesn't seem necessary.

Regards, Roger

On 11/05/2018 11:27 PM, Stuart Marks wrote:

...
On 10/19/18 11:26 AM, Joe Wang wrote:

...
Current version: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v06/ Hi Joe,

Thanks for updating the tests per my comments. Everything looks good now!

s'marks

-- With kind regards, Ivan Gerasimov

Joe Wang

5:59 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Thanks Ivan! I agree that the upfront edge case checks aren't really necessary, after all, the great majority of the use cases won't hit the edge case (with the size being a factor of the buffer). We're therefore better off without the checks. Here's an updated webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v08/ Best, Joe On 11/6/18, 4:21 PM, Ivan Gerasimov wrote:

...

Hi Joe!

I apologize, if it was already discussed before.

Why do you need to have 3 edge cases of nRead1 and nRead2 in lines 1596-1605: Was it to save a call to Arrays.mismatch()?

If I'm not mistaken, the code would work equally well, if lines 1596-1615 were replaced with just subrange 1606-1614.

With kind regards, Ivan

On 11/6/18 12:53 PM, Joe Wang wrote:

...
Thanks Andrew for pointing that out. It wasn't intentional, a result of copy n paste after removing the support class. It's fixed now: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v07/

-Joe

On 11/6/18, 10:27 AM, Andrew Luo wrote:

...
I'm not a reviewer, but just wanted to point out that the fully qualified Files.BUFFER_SIZE can be simplified to just BUFFER_SIZE. Searching through the code in Files.java I don't see many other instances of using Files.* (although I do see a few).

Thanks Andrew

-----Original Message----- From: core-libs-dev<core-libs-dev-bounces@openjdk.java.net> On Behalf Of Roger Riggs Sent: Tuesday, November 6, 2018 8:56 AM To: core-libs-dev@openjdk.java.net Subject: Re: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

+1

Though with 65 tests, I suspect that there are more cases than strictly needed to cover all the code flows. For example, three cases for testing when it is the same file doesn't seem necessary.

Regards, Roger

On 11/05/2018 11:27 PM, Stuart Marks wrote:

...
On 10/19/18 11:26 AM, Joe Wang wrote:

...
Current version: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v06/ Hi Joe,

Thanks for updating the tests per my comments. Everything looks good now!

s'marks

Ivan Gerasimov

7:39 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Thank you Joe! I like the last variant. With kind regards, Ivan On 11/7/18 9:59 AM, Joe Wang wrote:

...

Thanks Ivan!

I agree that the upfront edge case checks aren't really necessary, after all, the great majority of the use cases won't hit the edge case (with the size being a factor of the buffer). We're therefore better off without the checks.

Here's an updated webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v08/

Best, Joe

On 11/6/18, 4:21 PM, Ivan Gerasimov wrote:

...
Hi Joe!

I apologize, if it was already discussed before.

Why do you need to have 3 edge cases of nRead1 and nRead2 in lines 1596-1605: Was it to save a call to Arrays.mismatch()?

If I'm not mistaken, the code would work equally well, if lines 1596-1615 were replaced with just subrange 1606-1614.

With kind regards, Ivan

On 11/6/18 12:53 PM, Joe Wang wrote:

...
Thanks Andrew for pointing that out. It wasn't intentional, a result of copy n paste after removing the support class. It's fixed now: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v07/

-Joe

On 11/6/18, 10:27 AM, Andrew Luo wrote:

...
I'm not a reviewer, but just wanted to point out that the fully qualified Files.BUFFER_SIZE can be simplified to just BUFFER_SIZE. Searching through the code in Files.java I don't see many other instances of using Files.* (although I do see a few).

Thanks Andrew

-----Original Message----- From: core-libs-dev<core-libs-dev-bounces@openjdk.java.net> On Behalf Of Roger Riggs Sent: Tuesday, November 6, 2018 8:56 AM To: core-libs-dev@openjdk.java.net Subject: Re: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

+1

Though with 65 tests, I suspect that there are more cases than strictly needed to cover all the code flows. For example, three cases for testing when it is the same file doesn't seem necessary.

Regards, Roger

On 11/05/2018 11:27 PM, Stuart Marks wrote:

...
On 10/19/18 11:26 AM, Joe Wang wrote:

...
Current version: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v06/ Hi Joe,

Thanks for updating the tests per my comments. Everything looks good now!

s'marks

-- With kind regards, Ivan Gerasimov

Joe Wang

8 Nov 8 Nov

6:01 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Checked in. Thanks all for all of the great help! Best regards, Joe On 11/7/18, 11:39 AM, Ivan Gerasimov wrote:

...

Thank you Joe!

I like the last variant.

With kind regards, Ivan

On 11/7/18 9:59 AM, Joe Wang wrote:

...
Thanks Ivan!

I agree that the upfront edge case checks aren't really necessary, after all, the great majority of the use cases won't hit the edge case (with the size being a factor of the buffer). We're therefore better off without the checks.

Here's an updated webrev: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v08/

Best, Joe

On 11/6/18, 4:21 PM, Ivan Gerasimov wrote:

...
Hi Joe!

I apologize, if it was already discussed before.

Why do you need to have 3 edge cases of nRead1 and nRead2 in lines 1596-1605: Was it to save a call to Arrays.mismatch()?

If I'm not mistaken, the code would work equally well, if lines 1596-1615 were replaced with just subrange 1606-1614.

With kind regards, Ivan

On 11/6/18 12:53 PM, Joe Wang wrote:

...
Thanks Andrew for pointing that out. It wasn't intentional, a result of copy n paste after removing the support class. It's fixed now: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v07/

-Joe

On 11/6/18, 10:27 AM, Andrew Luo wrote:

...
I'm not a reviewer, but just wanted to point out that the fully qualified Files.BUFFER_SIZE can be simplified to just BUFFER_SIZE. Searching through the code in Files.java I don't see many other instances of using Files.* (although I do see a few).

Thanks Andrew

-----Original Message----- From: core-libs-dev<core-libs-dev-bounces@openjdk.java.net> On Behalf Of Roger Riggs Sent: Tuesday, November 6, 2018 8:56 AM To: core-libs-dev@openjdk.java.net Subject: Re: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

+1

Though with 65 tests, I suspect that there are more cases than strictly needed to cover all the code flows. For example, three cases for testing when it is the same file doesn't seem necessary.

Regards, Roger

On 11/05/2018 11:27 PM, Stuart Marks wrote:

...
On 10/19/18 11:26 AM, Joe Wang wrote: > Current version: > http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v06/ Hi Joe,

Thanks for updating the tests per my comments. Everything looks good now!

s'marks

Joe Wang

6 Nov 6 Nov

8:51 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

On 11/6/18, 8:55 AM, Roger Riggs wrote:

...

+1

Though with 65 tests, I suspect that there are more cases than strictly needed to cover all the code flows. For example, three cases for testing when it is the same file doesn't seem necessary.

Removed one of the three cases, leaving one edge case (non-existent) and one normal. Removed all but two that verifies the edge case where one of the file sizes is zero, see line 173 and 174. Removed most of the large files (1MB). The redundancy is not strictly necessary, the ones with 64kb are sufficient for verifying matching after a few iterations. All together, that reduced the tests from 65 to 46. http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v07/ Thanks, Joe

...

Regards, Roger

On 11/05/2018 11:27 PM, Stuart Marks wrote:

...
On 10/19/18 11:26 AM, Joe Wang wrote:

...
Current version: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v06/

Hi Joe,

Thanks for updating the tests per my comments. Everything looks good now!

s'marks

Joe Wang

8:32 p.m.

New subject: RFR(JDK 12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

Thanks Stuart! I'm glad it's finally close to the end :-) On 11/5/18, 8:27 PM, Stuart Marks wrote:

...

On 10/19/18 11:26 AM, Joe Wang wrote:

...
Current version: http://cr.openjdk.java.net/~joehw/jdk12/8202285/webrev_v06/

Hi Joe,

Thanks for updating the tests per my comments. Everything looks good now!

s'marks

2677

Age (days ago)

2727

Last active (days ago)

List overview

Download

43 comments

9 participants

participants (9)

Alan Bateman
Andrew Luo
Brian Burkhalter
Daniel Fuchs
Ivan Gerasimov
Joe Wang
Roger Riggs
Stuart Marks
Weijun Wang

RFR(JDK12/NIO) 8202285: (fs) Add a method to Files for comparing file contents

tags

participants (9)