<i18n dev> Reading Linux filenames in a way that will map back the same on open?

Sat Sep 13 09:32:52 PDT 2008

Sadly, I'm still getting ghost files with C and ISO-8859-1:

./wrapper
+ case 3 in
+ export LC_ALL=C
+ LC_ALL=C
+ export LC_CTYPE=C
+ LC_CTYPE=C
+ export LANG=C
+ LANG=C
+ find /home/dstromberg/Sound/Music -type f -print
+ java -Xmx512M -Dfile.encoding=ISO-8859-1 -jar equivs.jar equivs.main
java.io.FileNotFoundException: /home/dstromberg/Sound/Music/Various 
Artists/Dreamland/11 - CanciÃ³n Para Dormir a un NiÃ±o (Argentina).flac 
(No such file or directory)
         at java.io.FileInputStream.open(Native Method)
         at java.io.FileInputStream.<init>(FileInputStream.java:106)
         at java.io.FileInputStream.<init>(FileInputStream.java:66)
         at Sortable_file.get_prefix(Sortable_file.java:56)
         at Sortable_file.compareTo(Sortable_file.java:159)
         at Sortable_file.compareTo(Sortable_file.java:1)
         at java.util.Arrays.mergeSort(Arrays.java:1167)
         at java.util.Arrays.mergeSort(Arrays.java:1155)
         at java.util.Arrays.mergeSort(Arrays.java:1155)
         at java.util.Arrays.mergeSort(Arrays.java:1156)
         at java.util.Arrays.mergeSort(Arrays.java:1156)
         at java.util.Arrays.mergeSort(Arrays.java:1155)
         at java.util.Arrays.mergeSort(Arrays.java:1155)
         at java.util.Arrays.mergeSort(Arrays.java:1155)
         at java.util.Arrays.mergeSort(Arrays.java:1155)
         at java.util.Arrays.mergeSort(Arrays.java:1155)
         at java.util.Arrays.sort(Arrays.java:1079)
         at equivs.main(equivs.java:40)
make: *** [wrapped] Error 1

Martin Buchholz wrote:
> On Wed, Sep 10, 2008 at 17:50, Dan Stromberg <strombrg at gmail.com> wrote:
>>
>> Would you believe that I'm getting file not found errors even with
>> ISO-8859-1?
> 
> The software world is full of suprises.
> 
> Try
> export LANG=C LC_ALL=C LC_CTYPE=C
> java ... -Dfile.encoding=ISO-8859-1 ...
> 
> You could also be explicit about the
> encoding used when doing any kind of char<->byte
> conversion, e.g. reading from stdin or writing to stdout.
> 
> Oh, and this is only traditional Unix systems like
> Linux and Solaris.  Windows and MacOSX
> (at least should) act very differently in this area.
> 
> Martin
> 
>> (Naoto: My program doesn't know what encoding to expect - I'm afraid I
>> probably have different applications writing filenames in different
>> encodings on my Ubuntu system.  I'd been thinking I wanted to treat
>> filenames as just a sequence of bytes, and let the terminal emulator
>> interpret the encoding (hopefully) correctly on output).
>>
>>
>>
>> This gives two file not found tracebacks:
>>
>> export LC_ALL='ISO-8859-1'
>> export LC_CTYPE="$LC_ALL"
>> export LANG="$LC_ALL"
>>
>> find 'test-files' -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1
>> -jar equivs.jar equivs.main
>>
>> find ~/Sound/Music -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1
>> -jar equivs.jar equivs.main
>>
>>
>>
>> I'm reading the filenames like (please forgive the weird indentation) :
>>
>> try                                                        {
>>
>> while((line = stdin.readLine()) != null)
>>        {
>>                            // System.out.println(line);
>>           // System.out.flush();
>> lst.add(new Sortable_file(line));
>> }
>> }
>> catch(java.io.IOException e)
>> {
>> System.err.println("**** exception " + e);
>> e.printStackTrace();                                                     }
>>
>>
>>
>> Where Sortable_file's constructor just looks like:
>>
>>   public Sortable_file(String filename)
>>      {
>>      this.filename = filename;
>>      /*
>>      Java doesn't have a stat function without doing some fancy stuff, so we
>> skip this
>>      optimization.  It really only helps with hard links anyway.
>>      this.device = -1
>>      this.inode = -1
>>      */
>>      File file = new File(this.filename);
>>      this.size = file.length();
>>      // It bothers a little that we can't close this, but perhaps it's
>> unnecessary.  That'll
>>      // be determined in large tests.
>>      // file.close();
>>      this.have_prefix = false;
>>      this.have_hash = false;
>>      }
>>
>>
>>
>> ..and the part that actually blows up looks like:
>>
>>   private void get_prefix()
>>      {
>>      byte[] buffer = new byte[128];
>>      try
>>         {
>>         // The next line is the one that gives file not found
>>         FileInputStream file = new FileInputStream(this.filename);
>>         file.read(buffer);
>>         // System.out.println("this.prefix.length " + this.prefix.length);
>>         file.close();
>>         }
>>      catch (IOException ioe)
>>         {
>>         // System.out.println( "IO error: " + ioe );
>>         ioe.printStackTrace();
>>         System.exit(1);
>>         }
>>      this.prefix = new String(buffer);
>>      this.have_prefix = true;
>>      }
>>
>>
>>
>> Interestingly, it's already tried to get the file's length without an error
>> when it goes to read data from the file and has trouble.
>>
>> I don't -think- I'm doing anything screwy in there - could it be that
>> ISO-8859-1 isn't giving good round-trip conversions in practice?  Would this
>> be an attribute of the java runtime in question, or could it be a matter of
>> the locale files on my Ubuntu system being a little off?  It would seem the
>> locale files would be a better explanation (or a bug in my program I'm not
>> seeing!), since I get the same errors with both OpenJDK and gcj.
>>
>> Martin Buchholz wrote:
>>> ISO-8859-1 guarantees round-trip conversion between bytes and chars,
>>> guarateeing no loss of data, or getting apparently impossible situations
>>> where the JDK gives you a list of files in a directory, but you get
>>> File not found when you try to open them.
>>>
>>> If you want to show the file names to users, you can always take
>>> your ISO-8859-1 decoded strings, turn them back into byte[],
>>> and decode using UTF-8 later, if you so desired.
>>> (The basic OS interfaces in the JDK are not so flexible.
>>> They are hard-coded to use the one charset specified by file.encoding)
>>>
>>> Martin
>>>
>>> On Wed, Sep 10, 2008 at 14:54, Naoto Sato <Naoto.Sato at sun.com> wrote:
>>>> Why ISO-8859-1?  CJK filenames are guaranteed to fail in that case.  I'd
>>>> rather choose UTF-8, as the default encoding on recent Unix/Linux are all
>>>> UTF-8 so the filenames are likely in UTF-8.
>>>>
>>>> Naoto
>>>>
>>>> Martin Buchholz wrote:
>>>>> Java made the decision to use String as an abstraction
>>>>> for many OS-specific objects, like filenames (or environment variables).
>>>>> Most of the time this works fine, but occasionally you can notice
>>>>> that the underlying OS (in the case of Unix) actually uses
>>>>> arbitrary byte arrays as filenames.
>>>>>
>>>>> It would have been much more confusing to provide an interface
>>>>> to filenames that is sometimes a sequence of char, sometimes a
>>>>> sequence of byte.
>>>>>
>>>>> So this is unlikely to change.
>>>>>
>>>>> But if all you want is reliable reversible conversion,
>>>>> using java -Dfile.encoding=ISO-8859-1
>>>>> should do the trick.
>>>>>
>>>>> Martin
>>>>>
>>>>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg <dstromberglists at gmail.com>
>>>>> wrote:
>>>>>> Sorry if this is the wrong list for this question.  I tried asking it
>>>>>> on comp.lang.java, but didn't get very far there.
>>>>>>
>>>>>> I've been wanting to expand my horizons a bit by taking one of my
>>>>>> programs and rewriting it into a number of other languages.  It
>>>>>> started life in python, and I've recoded it into perl
>>>>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html).
>>>>>> Next on my list is java.  After that I'll probably do Haskell and
>>>>>> Eiffel/Sather.
>>>>>>
>>>>>> So the python and perl versions were pretty easy, but I'm finding that
>>>>>> the java version has a somewhat solution-resistant problem with
>>>>>> non-ASCII filenames.
>>>>>>
>>>>>> The program just reads filenames from stdin (usually generated with
>>>>>> the *ix find command), and then compares those files, dividing them up
>>>>>> into equal groups.
>>>>>>
>>>>>> The problem with the java version, which manifests both with OpenJDK
>>>>>> and gcj, is that the filenames being read from disk are 8 bit, and the
>>>>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit,
>>>>>> but as far as the java language is concerned, those filenames are made
>>>>>> up of 16 bit characters.  That's fine, but going from 8 to 16 bit and
>>>>>> back to 8 bit seems to be non-information-preserving in this case,
>>>>>> which isn't so fine - I can clearly see the program, in an strace,
>>>>>> reading with one sequence of bytes, but then trying to open
>>>>>> another-though-related sequence of bytes.  To be perfectly clear: It's
>>>>>> getting file not found errors.
>>>>>>
>>>>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the
>>>>>> program to handle files with one encoding, but not another.  I've
>>>>>> tried a bunch of values in these variables, including ISO-8859-1, C,
>>>>>> POSIX, UTF-8, and so on.
>>>>>>
>>>>>> Is there such a thing as a filename encoding that will map 8 bit
>>>>>> filenames to 16 bit characters, but only using the low 8 bits of those
>>>>>> 16, and then map back to 8 bit filenames only using those low 8 bits
>>>>>> again?
>>>>>>
>>>>>> Is there some other way of making a Java program on Linux able to read
>>>>>> filenames from stdin and later open those filenames?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>> --
>>>> Naoto Sato
>>>>
>>