<i18n dev> Reading Linux filenames in a way that will map back the same on open?

Wed Sep 10 17:50:49 PDT 2008

Would you believe that I'm getting file not found errors even with 
ISO-8859-1?

(Naoto: My program doesn't know what encoding to expect - I'm afraid I 
probably have different applications writing filenames in different 
encodings on my Ubuntu system.  I'd been thinking I wanted to treat 
filenames as just a sequence of bytes, and let the terminal emulator 
interpret the encoding (hopefully) correctly on output).

This gives two file not found tracebacks:

export LC_ALL='ISO-8859-1'
export LC_CTYPE="$LC_ALL"
export LANG="$LC_ALL"

find 'test-files' -type f -print | java -Xmx512M 
-Dfile.encoding=ISO-8859-1 -jar equivs.jar equivs.main

find ~/Sound/Music -type f -print | java -Xmx512M 
-Dfile.encoding=ISO-8859-1 -jar equivs.jar equivs.main

I'm reading the filenames like (please forgive the weird indentation) :

try 
                                                         { 

while((line = stdin.readLine()) != null) 
                                             { 

                             // System.out.println(line); 

            // System.out.flush(); 

lst.add(new Sortable_file(line));
}
}
catch(java.io.IOException e)
{ 

System.err.println("**** exception " + e); 

e.printStackTrace(); 
                                                      }

Where Sortable_file's constructor just looks like:

    public Sortable_file(String filename)
       {
       this.filename = filename;
       /*
       Java doesn't have a stat function without doing some fancy stuff, 
so we skip this
       optimization.  It really only helps with hard links anyway.
       this.device = -1
       this.inode = -1
       */
       File file = new File(this.filename);
       this.size = file.length();
       // It bothers a little that we can't close this, but perhaps it's 
unnecessary.  That'll
       // be determined in large tests.
       // file.close();
       this.have_prefix = false;
       this.have_hash = false;
       }

..and the part that actually blows up looks like:

    private void get_prefix()
       {
       byte[] buffer = new byte[128];
       try
          {
          // The next line is the one that gives file not found
          FileInputStream file = new FileInputStream(this.filename);
          file.read(buffer);
          // System.out.println("this.prefix.length " + this.prefix.length);
          file.close();
          }
       catch (IOException ioe)
          {
          // System.out.println( "IO error: " + ioe );
          ioe.printStackTrace();
          System.exit(1);
          }
       this.prefix = new String(buffer);
       this.have_prefix = true;
       }

Interestingly, it's already tried to get the file's length without an 
error when it goes to read data from the file and has trouble.

I don't -think- I'm doing anything screwy in there - could it be that 
ISO-8859-1 isn't giving good round-trip conversions in practice?  Would 
this be an attribute of the java runtime in question, or could it be a 
matter of the locale files on my Ubuntu system being a little off?  It 
would seem the locale files would be a better explanation (or a bug in 
my program I'm not seeing!), since I get the same errors with both 
OpenJDK and gcj.

Martin Buchholz wrote:
> ISO-8859-1 guarantees round-trip conversion between bytes and chars,
> guarateeing no loss of data, or getting apparently impossible situations
> where the JDK gives you a list of files in a directory, but you get
> File not found when you try to open them.
> 
> If you want to show the file names to users, you can always take
> your ISO-8859-1 decoded strings, turn them back into byte[],
> and decode using UTF-8 later, if you so desired.
> (The basic OS interfaces in the JDK are not so flexible.
> They are hard-coded to use the one charset specified by file.encoding)
> 
> Martin
> 
> On Wed, Sep 10, 2008 at 14:54, Naoto Sato <Naoto.Sato at sun.com> wrote:
>> Why ISO-8859-1?  CJK filenames are guaranteed to fail in that case.  I'd
>> rather choose UTF-8, as the default encoding on recent Unix/Linux are all
>> UTF-8 so the filenames are likely in UTF-8.
>>
>> Naoto
>>
>> Martin Buchholz wrote:
>>> Java made the decision to use String as an abstraction
>>> for many OS-specific objects, like filenames (or environment variables).
>>> Most of the time this works fine, but occasionally you can notice
>>> that the underlying OS (in the case of Unix) actually uses
>>> arbitrary byte arrays as filenames.
>>>
>>> It would have been much more confusing to provide an interface
>>> to filenames that is sometimes a sequence of char, sometimes a
>>> sequence of byte.
>>>
>>> So this is unlikely to change.
>>>
>>> But if all you want is reliable reversible conversion,
>>> using java -Dfile.encoding=ISO-8859-1
>>> should do the trick.
>>>
>>> Martin
>>>
>>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg <dstromberglists at gmail.com>
>>> wrote:
>>>> Sorry if this is the wrong list for this question.  I tried asking it
>>>> on comp.lang.java, but didn't get very far there.
>>>>
>>>> I've been wanting to expand my horizons a bit by taking one of my
>>>> programs and rewriting it into a number of other languages.  It
>>>> started life in python, and I've recoded it into perl
>>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html).
>>>> Next on my list is java.  After that I'll probably do Haskell and
>>>> Eiffel/Sather.
>>>>
>>>> So the python and perl versions were pretty easy, but I'm finding that
>>>> the java version has a somewhat solution-resistant problem with
>>>> non-ASCII filenames.
>>>>
>>>> The program just reads filenames from stdin (usually generated with
>>>> the *ix find command), and then compares those files, dividing them up
>>>> into equal groups.
>>>>
>>>> The problem with the java version, which manifests both with OpenJDK
>>>> and gcj, is that the filenames being read from disk are 8 bit, and the
>>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit,
>>>> but as far as the java language is concerned, those filenames are made
>>>> up of 16 bit characters.  That's fine, but going from 8 to 16 bit and
>>>> back to 8 bit seems to be non-information-preserving in this case,
>>>> which isn't so fine - I can clearly see the program, in an strace,
>>>> reading with one sequence of bytes, but then trying to open
>>>> another-though-related sequence of bytes.  To be perfectly clear: It's
>>>> getting file not found errors.
>>>>
>>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the
>>>> program to handle files with one encoding, but not another.  I've
>>>> tried a bunch of values in these variables, including ISO-8859-1, C,
>>>> POSIX, UTF-8, and so on.
>>>>
>>>> Is there such a thing as a filename encoding that will map 8 bit
>>>> filenames to 16 bit characters, but only using the low 8 bits of those
>>>> 16, and then map back to 8 bit filenames only using those low 8 bits
>>>> again?
>>>>
>>>> Is there some other way of making a Java program on Linux able to read
>>>> filenames from stdin and later open those filenames?
>>>>
>>>> Thanks!
>>>>
>>
>> --
>> Naoto Sato
>>