<i18n dev> Reading Linux filenames in a way that will map back the same on open?

Naoto Sato Naoto.Sato at Sun.COM
Wed Sep 10 14:54:31 PDT 2008


Why ISO-8859-1?  CJK filenames are guaranteed to fail in that case.  I'd 
rather choose UTF-8, as the default encoding on recent Unix/Linux are 
all UTF-8 so the filenames are likely in UTF-8.

Naoto

Martin Buchholz wrote:
> Java made the decision to use String as an abstraction
> for many OS-specific objects, like filenames (or environment variables).
> Most of the time this works fine, but occasionally you can notice
> that the underlying OS (in the case of Unix) actually uses
> arbitrary byte arrays as filenames.
> 
> It would have been much more confusing to provide an interface
> to filenames that is sometimes a sequence of char, sometimes a
> sequence of byte.
> 
> So this is unlikely to change.
> 
> But if all you want is reliable reversible conversion,
> using java -Dfile.encoding=ISO-8859-1
> should do the trick.
> 
> Martin
> 
> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg <dstromberglists at gmail.com> wrote:
>> Sorry if this is the wrong list for this question.  I tried asking it
>> on comp.lang.java, but didn't get very far there.
>>
>> I've been wanting to expand my horizons a bit by taking one of my
>> programs and rewriting it into a number of other languages.  It
>> started life in python, and I've recoded it into perl
>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html).
>> Next on my list is java.  After that I'll probably do Haskell and
>> Eiffel/Sather.
>>
>> So the python and perl versions were pretty easy, but I'm finding that
>> the java version has a somewhat solution-resistant problem with
>> non-ASCII filenames.
>>
>> The program just reads filenames from stdin (usually generated with
>> the *ix find command), and then compares those files, dividing them up
>> into equal groups.
>>
>> The problem with the java version, which manifests both with OpenJDK
>> and gcj, is that the filenames being read from disk are 8 bit, and the
>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit,
>> but as far as the java language is concerned, those filenames are made
>> up of 16 bit characters.  That's fine, but going from 8 to 16 bit and
>> back to 8 bit seems to be non-information-preserving in this case,
>> which isn't so fine - I can clearly see the program, in an strace,
>> reading with one sequence of bytes, but then trying to open
>> another-though-related sequence of bytes.  To be perfectly clear: It's
>> getting file not found errors.
>>
>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the
>> program to handle files with one encoding, but not another.  I've
>> tried a bunch of values in these variables, including ISO-8859-1, C,
>> POSIX, UTF-8, and so on.
>>
>> Is there such a thing as a filename encoding that will map 8 bit
>> filenames to 16 bit characters, but only using the low 8 bits of those
>> 16, and then map back to 8 bit filenames only using those low 8 bits
>> again?
>>
>> Is there some other way of making a Java program on Linux able to read
>> filenames from stdin and later open those filenames?
>>
>> Thanks!
>>


-- 
Naoto Sato



More information about the i18n-dev mailing list