number of bytes for each (uni)code point while using utf-8 as encoding ...
Albretch Mueller
lbrtchx at gmail.com
Wed Jul 11 03:39:22 PDT 2012
~
you may iterate through all (uni)code points in a file encoded as
utf-8 (or any other encoding) by going like this:
~
...
// __
String aOEnc = "UTF-8";
Charset InChrSt = Charset.forName(aOEnc);
CharsetDecoder InDec = InChrSt.newDecoder();
InDec.onMalformedInput(CodingErrorAction.REPORT);
InDec.onUnmappableCharacter(CodingErrorAction.REPORT);
// __
FIS = new FileInputStream(new File(<file path as string>));
FileChannel IFlChnl = FIS.getChannel();
MappedByteBuffer MptBytBfr =
IFlChnl.map(FileChannel.MapMode.READ_ONLY, 0, (int)IFlChnl.size());
CharBuffer MptChrBfr = InDec.decode(MptBytBfr);
// __
for (int j = 0; (j < MptChrBfr.length()); ++j){
MptChrBfr.get();
...
}
...
~
each time you get() a (uni)code point from the buffer, you will get
from 1 to 4(+?) bytes (and the sum of all "lengths" should equal the
file length in bytes, right?)
~
I am using the (new) nio in java 7 and I wonder if sun made changes
which make hard getting lenghts of bytes a unicode points
~
The reason I need such a thing is because sometimes authors claim html
pages to be UTF-8 encoded but they aren't, and/or pages get somehow
corrupted in transit. There are quite a bit of "monkeys" (us technical
people ;-)) out there messing with the metadata headers of html pages
so it is not totally safe to use that the "Content-Type" specification
and if something goes wrong there should be a way to get as close to
the offset of the problematic char as possible
~
How can you get the number of bytes you "get()"?
~
thank you
lbrtchx
nio-discuss at openjdk.java.net: number of bytes for each (uni)code point
while using utf-8 as encoding ...
More information about the nio-discuss
mailing list