<i18n dev> Reading Linux filenames in a way that will map back the same on open?

Martin Buchholz martinrb at google.com
Tue Sep 9 20:58:02 PDT 2008


Java made the decision to use String as an abstraction
for many OS-specific objects, like filenames (or environment variables).
Most of the time this works fine, but occasionally you can notice
that the underlying OS (in the case of Unix) actually uses
arbitrary byte arrays as filenames.

It would have been much more confusing to provide an interface
to filenames that is sometimes a sequence of char, sometimes a
sequence of byte.

So this is unlikely to change.

But if all you want is reliable reversible conversion,
using java -Dfile.encoding=ISO-8859-1
should do the trick.

Martin

On Tue, Sep 9, 2008 at 17:39, Dan Stromberg <dstromberglists at gmail.com> wrote:
> Sorry if this is the wrong list for this question.  I tried asking it
> on comp.lang.java, but didn't get very far there.
>
> I've been wanting to expand my horizons a bit by taking one of my
> programs and rewriting it into a number of other languages.  It
> started life in python, and I've recoded it into perl
> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html).
> Next on my list is java.  After that I'll probably do Haskell and
> Eiffel/Sather.
>
> So the python and perl versions were pretty easy, but I'm finding that
> the java version has a somewhat solution-resistant problem with
> non-ASCII filenames.
>
> The program just reads filenames from stdin (usually generated with
> the *ix find command), and then compares those files, dividing them up
> into equal groups.
>
> The problem with the java version, which manifests both with OpenJDK
> and gcj, is that the filenames being read from disk are 8 bit, and the
> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit,
> but as far as the java language is concerned, those filenames are made
> up of 16 bit characters.  That's fine, but going from 8 to 16 bit and
> back to 8 bit seems to be non-information-preserving in this case,
> which isn't so fine - I can clearly see the program, in an strace,
> reading with one sequence of bytes, but then trying to open
> another-though-related sequence of bytes.  To be perfectly clear: It's
> getting file not found errors.
>
> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the
> program to handle files with one encoding, but not another.  I've
> tried a bunch of values in these variables, including ISO-8859-1, C,
> POSIX, UTF-8, and so on.
>
> Is there such a thing as a filename encoding that will map 8 bit
> filenames to 16 bit characters, but only using the low 8 bits of those
> 16, and then map back to 8 bit filenames only using those low 8 bits
> again?
>
> Is there some other way of making a Java program on Linux able to read
> filenames from stdin and later open those filenames?
>
> Thanks!
>



More information about the i18n-dev mailing list