RFR 8072773 (fs) Files.lines needs a better splitting implementation for stream source

Wed Jun 3 19:19:33 UTC 2015

On 06/03/2015 08:53 AM, Paul Sandoz wrote:
> Hi,
>
> Please review an optimization for Files.lines for certain charsets:
>
>    http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8072773-File-lines/webrev/
>
> If a charset is say US-ASCII or UTF-8 it is possible to implement an efficient splitting Spliterator that scans bytes from a mid-point to search for line feed characters.
>
> Splitting uses a mapped byte buffer. Traversal uses FileChannel.reads at an offset. In previous incarnations i tried to use mapped byte buffer for both, but for some reason the traversal performance was not good (both on Mac and x86). In any case i am happy with the current approach as there is minimal layering between the FileChannel and BufferedReader leveraged to read the lines.
>
> Sequential performance is similar (same or better) than the current approach. Parallel performance is much better than the current approach.
>
> Some advice on two aspects would be most appreciated:
>
> 1) Is there an easy way to determine the sub-set of supported charsets that are applicable?
>

It's easy though a little heavy :-) getLFCR returns a byte[] for the "byte" form of
the \n and \r in a particular encodings, if each one of them can be mapped into
one byte. Then we can use b[0] for \n and b[1] for \r in trySplit(). This makes
the new fast version work for most of charsets.

     private static byte[] getLFCR(Charset cs) {
         try {
             if (cs.canEncode()) {
                 ByteBuffer bb = cs.newEncoder()
                                   .encode(CharBuffer.wrap(new char[] { '\n', '\r' }));
                 if (bb.remaining() == 2) {
                     CharBuffer cb = cs.newDecoder().decode(bb);
                     if (cb.remaining() == 2 &&
                         cb.get() == '\n' && cb.get() == '\r') {
                         bb.flip();
                         byte[] ba = new byte[2];
                         bb.get(ba);
                         return ba;
                     }
                 }
             }
         } catch (Exception x) {}
         return null;

     }

-sherman