<i18n dev> Reading Linux filenames in a way that will map back the same on open?

Martin Buchholz martinrb at google.com
Wed Sep 10 15:14:16 PDT 2008


ISO-8859-1 guarantees round-trip conversion between bytes and chars,
guarateeing no loss of data, or getting apparently impossible situations
where the JDK gives you a list of files in a directory, but you get
File not found when you try to open them.

If you want to show the file names to users, you can always take
your ISO-8859-1 decoded strings, turn them back into byte[],
and decode using UTF-8 later, if you so desired.
(The basic OS interfaces in the JDK are not so flexible.
They are hard-coded to use the one charset specified by file.encoding)

Martin

On Wed, Sep 10, 2008 at 14:54, Naoto Sato <Naoto.Sato at sun.com> wrote:
> Why ISO-8859-1?  CJK filenames are guaranteed to fail in that case.  I'd
> rather choose UTF-8, as the default encoding on recent Unix/Linux are all
> UTF-8 so the filenames are likely in UTF-8.
>
> Naoto
>
> Martin Buchholz wrote:
>>
>> Java made the decision to use String as an abstraction
>> for many OS-specific objects, like filenames (or environment variables).
>> Most of the time this works fine, but occasionally you can notice
>> that the underlying OS (in the case of Unix) actually uses
>> arbitrary byte arrays as filenames.
>>
>> It would have been much more confusing to provide an interface
>> to filenames that is sometimes a sequence of char, sometimes a
>> sequence of byte.
>>
>> So this is unlikely to change.
>>
>> But if all you want is reliable reversible conversion,
>> using java -Dfile.encoding=ISO-8859-1
>> should do the trick.
>>
>> Martin
>>
>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg <dstromberglists at gmail.com>
>> wrote:
>>>
>>> Sorry if this is the wrong list for this question.  I tried asking it
>>> on comp.lang.java, but didn't get very far there.
>>>
>>> I've been wanting to expand my horizons a bit by taking one of my
>>> programs and rewriting it into a number of other languages.  It
>>> started life in python, and I've recoded it into perl
>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html).
>>> Next on my list is java.  After that I'll probably do Haskell and
>>> Eiffel/Sather.
>>>
>>> So the python and perl versions were pretty easy, but I'm finding that
>>> the java version has a somewhat solution-resistant problem with
>>> non-ASCII filenames.
>>>
>>> The program just reads filenames from stdin (usually generated with
>>> the *ix find command), and then compares those files, dividing them up
>>> into equal groups.
>>>
>>> The problem with the java version, which manifests both with OpenJDK
>>> and gcj, is that the filenames being read from disk are 8 bit, and the
>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit,
>>> but as far as the java language is concerned, those filenames are made
>>> up of 16 bit characters.  That's fine, but going from 8 to 16 bit and
>>> back to 8 bit seems to be non-information-preserving in this case,
>>> which isn't so fine - I can clearly see the program, in an strace,
>>> reading with one sequence of bytes, but then trying to open
>>> another-though-related sequence of bytes.  To be perfectly clear: It's
>>> getting file not found errors.
>>>
>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the
>>> program to handle files with one encoding, but not another.  I've
>>> tried a bunch of values in these variables, including ISO-8859-1, C,
>>> POSIX, UTF-8, and so on.
>>>
>>> Is there such a thing as a filename encoding that will map 8 bit
>>> filenames to 16 bit characters, but only using the low 8 bits of those
>>> 16, and then map back to 8 bit filenames only using those low 8 bits
>>> again?
>>>
>>> Is there some other way of making a Java program on Linux able to read
>>> filenames from stdin and later open those filenames?
>>>
>>> Thanks!
>>>
>
>
> --
> Naoto Sato
>



More information about the i18n-dev mailing list