RFR 8072773 (fs) Files.lines needs a better splitting implementation for stream source

Wed Jun 3 19:58:58 UTC 2015

On Jun 3, 2015, at 9:19 PM, Xueming Shen <xueming.shen at oracle.com> wrote:

> On 06/03/2015 08:53 AM, Paul Sandoz wrote:
>> Hi,
>> 
>> Please review an optimization for Files.lines for certain charsets:
>> 
>>  http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8072773-File-lines/webrev/
>> 
>> If a charset is say US-ASCII or UTF-8 it is possible to implement an efficient splitting Spliterator that scans bytes from a mid-point to search for line feed characters.
>> 
>> Splitting uses a mapped byte buffer. Traversal uses FileChannel.reads at an offset. In previous incarnations i tried to use mapped byte buffer for both, but for some reason the traversal performance was not good (both on Mac and x86). In any case i am happy with the current approach as there is minimal layering between the FileChannel and BufferedReader leveraged to read the lines.
>> 
>> Sequential performance is similar (same or better) than the current approach. Parallel performance is much better than the current approach.
>> 
>> Some advice on two aspects would be most appreciated:
>> 
>> 1) Is there an easy way to determine the sub-set of supported charsets that are applicable?
>> 
> 
> It's easy though a little heavy :-)

Thanks, that is a little heavy, but i suppose computed values for charsets could be stashed in a static CHM.

Paul.

> getLFCR returns a byte[] for the "byte" form of
> the \n and \r in a particular encodings, if each one of them can be mapped into
> one byte. Then we can use b[0] for \n and b[1] for \r in trySplit(). This makes
> the new fast version work for most of charsets.
> 
>   private static byte[] getLFCR(Charset cs) {
>       try {
>           if (cs.canEncode()) {
>               ByteBuffer bb = cs.newEncoder()
>                                 .encode(CharBuffer.wrap(new char[] { '\n', '\r' }));
>               if (bb.remaining() == 2) {
>                   CharBuffer cb = cs.newDecoder().decode(bb);
>                   if (cb.remaining() == 2 &&
>                       cb.get() == '\n' && cb.get() == '\r') {
>                       bb.flip();
>                       byte[] ba = new byte[2];
>                       bb.get(ba);
>                       return ba;
>                   }
>               }
>           }
>       } catch (Exception x) {}
>       return null;
> 
>   }
> 
> -sherman