RFR 8072773 (fs) Files.lines needs a better splitting implementation for stream source
Xueming Shen
xueming.shen at oracle.com
Wed Jun 3 19:19:33 UTC 2015
On 06/03/2015 08:53 AM, Paul Sandoz wrote:
> Hi,
>
> Please review an optimization for Files.lines for certain charsets:
>
> http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8072773-File-lines/webrev/
>
> If a charset is say US-ASCII or UTF-8 it is possible to implement an efficient splitting Spliterator that scans bytes from a mid-point to search for line feed characters.
>
> Splitting uses a mapped byte buffer. Traversal uses FileChannel.reads at an offset. In previous incarnations i tried to use mapped byte buffer for both, but for some reason the traversal performance was not good (both on Mac and x86). In any case i am happy with the current approach as there is minimal layering between the FileChannel and BufferedReader leveraged to read the lines.
>
> Sequential performance is similar (same or better) than the current approach. Parallel performance is much better than the current approach.
>
> Some advice on two aspects would be most appreciated:
>
> 1) Is there an easy way to determine the sub-set of supported charsets that are applicable?
>
It's easy though a little heavy :-) getLFCR returns a byte[] for the "byte" form of
the \n and \r in a particular encodings, if each one of them can be mapped into
one byte. Then we can use b[0] for \n and b[1] for \r in trySplit(). This makes
the new fast version work for most of charsets.
private static byte[] getLFCR(Charset cs) {
try {
if (cs.canEncode()) {
ByteBuffer bb = cs.newEncoder()
.encode(CharBuffer.wrap(new char[] { '\n', '\r' }));
if (bb.remaining() == 2) {
CharBuffer cb = cs.newDecoder().decode(bb);
if (cb.remaining() == 2 &&
cb.get() == '\n' && cb.get() == '\r') {
bb.flip();
byte[] ba = new byte[2];
bb.get(ba);
return ba;
}
}
}
} catch (Exception x) {}
return null;
}
-sherman
More information about the core-libs-dev
mailing list