number of bytes for each (uni)code point while using utf-8 as encoding ...

Wed Jul 11 03:39:22 PDT 2012

~
you may iterate through all (uni)code points in a file encoded as
utf-8 (or any other encoding) by going like this:
~
 ...
// __
    String aOEnc = "UTF-8";
    Charset InChrSt = Charset.forName(aOEnc);
    CharsetDecoder InDec = InChrSt.newDecoder();
    InDec.onMalformedInput(CodingErrorAction.REPORT);
    InDec.onUnmappableCharacter(CodingErrorAction.REPORT);
// __
    FIS = new FileInputStream(new File(<file path as string>));
    FileChannel IFlChnl = FIS.getChannel();
    MappedByteBuffer MptBytBfr =
IFlChnl.map(FileChannel.MapMode.READ_ONLY, 0, (int)IFlChnl.size());
    CharBuffer MptChrBfr = InDec.decode(MptBytBfr);
// __
    for (int j = 0; (j < MptChrBfr.length()); ++j){
     MptChrBfr.get();
 ...
    }
 ...

~
each time you get() a (uni)code point from the buffer, you will get
from 1 to 4(+?) bytes (and the sum of all "lengths" should equal the
file length in bytes, right?)
~
I am using the (new) nio in java 7 and I wonder if sun made changes
which make hard getting lenghts of bytes a unicode points
~
The reason I need such a thing is because sometimes authors claim html
pages to be UTF-8 encoded but they aren't, and/or pages get somehow
corrupted in transit. There are quite a bit of "monkeys" (us technical
people ;-)) out there messing with the metadata headers of html pages
so it is not totally safe to use that the "Content-Type" specification
and if something goes wrong there should be a way to get as close to
the offset of the problematic char as possible
~
How can you get the number of bytes you "get()"?
~
thank you
lbrtchx
nio-discuss at openjdk.java.net: number of bytes for each (uni)code point
while using utf-8 as encoding ...