<i18n dev> Reading Linux filenames in a way that will map back the same on open?

Sat Sep 13 18:49:33 PDT 2008

It still errors with a file not found:

+ LC_ALL=en_US.ISO8859-1
+ export LC_ALL
+ find /home/dstromberg/Sound/Music -type f -print
+ java -Xmx512M -jar equivs.jar equivs.main
java.io.FileNotFoundException: /home/dstromberg/Sound/Music/Various 
Artists/Dreamland/11 - Canci??n Para Dormir a un Ni??o (Argentina).flac 
(No such file or directory)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:106)
        at java.io.FileInputStream.<init>(FileInputStream.java:66)
        at Sortable_file.get_prefix(Sortable_file.java:56)
        at Sortable_file.compareTo(Sortable_file.java:159)
        at Sortable_file.compareTo(Sortable_file.java:1)
        at java.util.Arrays.mergeSort(Arrays.java:1167)
        at java.util.Arrays.mergeSort(Arrays.java:1155)
        at java.util.Arrays.mergeSort(Arrays.java:1155)
        at java.util.Arrays.mergeSort(Arrays.java:1156)
        at java.util.Arrays.mergeSort(Arrays.java:1156)
        at java.util.Arrays.mergeSort(Arrays.java:1155)
        at java.util.Arrays.mergeSort(Arrays.java:1155)
        at java.util.Arrays.mergeSort(Arrays.java:1155)
        at java.util.Arrays.mergeSort(Arrays.java:1155)
        at java.util.Arrays.mergeSort(Arrays.java:1155)
        at java.util.Arrays.sort(Arrays.java:1079)
        at equivs.main(equivs.java:40)
make: *** [wrapped] Error 1

...and the foo.java program gives:

$ LC_ALL=en_US.ISO8859-1; export LC_ALL; java foo
sun.jnu.encoding=ANSI_X3.4-1968
file.encoding=ANSI_X3.4-1968
default locale=en_US

Thanks folks.

Xueming Shen wrote:
>
> Martin, don't trap people into using -Dfile.encoding, always treat it 
> as a read only property:-)
>
> I believe initializeEncoding(env) gets invoked before -Dxyz=abc  
> overwrites the default one,
> beside the "jnu encoding" is introduced in 6.0, so we no longer look 
> file.encoding since, I believe
> you "ARE" the reviewer:-)
>
> Dan, I kind of feel, switch the locale to a sio8859-1 locale in your 
> wrapper, for example
>
> LC_ALL=en_US.ISO8859-1; export LC_ALL; java -Xmx512M -jar equivs.jar 
> equivs.main
>
> should work, if it does not, can you try to run
>
> LC_ALL=en_US.ISO8859-1; export LC_ALL; java Foo
>
> with Foo.java
>
> System.out.println("sun.jnu.encoding=" + 
> System.getProperty("sun.jnu.encoding"));
> System.out.println("file.encoding=" + 
> System.getProperty("file.encoding"));
> System.out.println("default locale=" + java.util.Locale.getDefault());
>
> Let us know the result?
>
> sherman
>
>
> Martin Buchholz wrote:
>> On Wed, Sep 10, 2008 at 17:50, Dan Stromberg <strombrg at gmail.com> wrote:
>>  
>>> Would you believe that I'm getting file not found errors even with
>>> ISO-8859-1?
>>>     
>>
>> The software world is full of suprises.
>>
>> Try
>> export LANG=C LC_ALL=C LC_CTYPE=C
>> java ... -Dfile.encoding=ISO-8859-1 ...
>>
>> You could also be explicit about the
>> encoding used when doing any kind of char<->byte
>> conversion, e.g. reading from stdin or writing to stdout.
>>
>> Oh, and this is only traditional Unix systems like
>> Linux and Solaris.  Windows and MacOSX
>> (at least should) act very differently in this area.
>>
>> Martin
>>
>>  
>>> (Naoto: My program doesn't know what encoding to expect - I'm afraid I
>>> probably have different applications writing filenames in different
>>> encodings on my Ubuntu system.  I'd been thinking I wanted to treat
>>> filenames as just a sequence of bytes, and let the terminal emulator
>>> interpret the encoding (hopefully) correctly on output).
>>>
>>>
>>>
>>> This gives two file not found tracebacks:
>>>
>>> export LC_ALL='ISO-8859-1'
>>> export LC_CTYPE="$LC_ALL"
>>> export LANG="$LC_ALL"
>>>
>>> find 'test-files' -type f -print | java -Xmx512M 
>>> -Dfile.encoding=ISO-8859-1
>>> -jar equivs.jar equivs.main
>>>
>>> find ~/Sound/Music -type f -print | java -Xmx512M 
>>> -Dfile.encoding=ISO-8859-1
>>> -jar equivs.jar equivs.main
>>>
>>>
>>>
>>> I'm reading the filenames like (please forgive the weird indentation) :
>>>
>>> try                                                        {
>>>
>>> while((line = stdin.readLine()) != null)
>>>        {
>>>                            // System.out.println(line);
>>>           // System.out.flush();
>>> lst.add(new Sortable_file(line));
>>> }
>>> }
>>> catch(java.io.IOException e)
>>> {
>>> System.err.println("**** exception " + e);
>>> e.printStackTrace();                                                     
>>> }
>>>
>>>
>>>
>>> Where Sortable_file's constructor just looks like:
>>>
>>>   public Sortable_file(String filename)
>>>      {
>>>      this.filename = filename;
>>>      /*
>>>      Java doesn't have a stat function without doing some fancy 
>>> stuff, so we
>>> skip this
>>>      optimization.  It really only helps with hard links anyway.
>>>      this.device = -1
>>>      this.inode = -1
>>>      */
>>>      File file = new File(this.filename);
>>>      this.size = file.length();
>>>      // It bothers a little that we can't close this, but perhaps it's
>>> unnecessary.  That'll
>>>      // be determined in large tests.
>>>      // file.close();
>>>      this.have_prefix = false;
>>>      this.have_hash = false;
>>>      }
>>>
>>>
>>>
>>> ..and the part that actually blows up looks like:
>>>
>>>   private void get_prefix()
>>>      {
>>>      byte[] buffer = new byte[128];
>>>      try
>>>         {
>>>         // The next line is the one that gives file not found
>>>         FileInputStream file = new FileInputStream(this.filename);
>>>         file.read(buffer);
>>>         // System.out.println("this.prefix.length " + 
>>> this.prefix.length);
>>>         file.close();
>>>         }
>>>      catch (IOException ioe)
>>>         {
>>>         // System.out.println( "IO error: " + ioe );
>>>         ioe.printStackTrace();
>>>         System.exit(1);
>>>         }
>>>      this.prefix = new String(buffer);
>>>      this.have_prefix = true;
>>>      }
>>>
>>>
>>>
>>> Interestingly, it's already tried to get the file's length without 
>>> an error
>>> when it goes to read data from the file and has trouble.
>>>
>>> I don't -think- I'm doing anything screwy in there - could it be that
>>> ISO-8859-1 isn't giving good round-trip conversions in practice?  
>>> Would this
>>> be an attribute of the java runtime in question, or could it be a 
>>> matter of
>>> the locale files on my Ubuntu system being a little off?  It would 
>>> seem the
>>> locale files would be a better explanation (or a bug in my program 
>>> I'm not
>>> seeing!), since I get the same errors with both OpenJDK and gcj.
>>>
>>> Martin Buchholz wrote:
>>>    
>>>> ISO-8859-1 guarantees round-trip conversion between bytes and chars,
>>>> guarateeing no loss of data, or getting apparently impossible 
>>>> situations
>>>> where the JDK gives you a list of files in a directory, but you get
>>>> File not found when you try to open them.
>>>>
>>>> If you want to show the file names to users, you can always take
>>>> your ISO-8859-1 decoded strings, turn them back into byte[],
>>>> and decode using UTF-8 later, if you so desired.
>>>> (The basic OS interfaces in the JDK are not so flexible.
>>>> They are hard-coded to use the one charset specified by file.encoding)
>>>>
>>>> Martin
>>>>
>>>> On Wed, Sep 10, 2008 at 14:54, Naoto Sato <Naoto.Sato at sun.com> wrote:
>>>>      
>>>>> Why ISO-8859-1?  CJK filenames are guaranteed to fail in that 
>>>>> case.  I'd
>>>>> rather choose UTF-8, as the default encoding on recent Unix/Linux 
>>>>> are all
>>>>> UTF-8 so the filenames are likely in UTF-8.
>>>>>
>>>>> Naoto
>>>>>
>>>>> Martin Buchholz wrote:
>>>>>        
>>>>>> Java made the decision to use String as an abstraction
>>>>>> for many OS-specific objects, like filenames (or environment 
>>>>>> variables).
>>>>>> Most of the time this works fine, but occasionally you can notice
>>>>>> that the underlying OS (in the case of Unix) actually uses
>>>>>> arbitrary byte arrays as filenames.
>>>>>>
>>>>>> It would have been much more confusing to provide an interface
>>>>>> to filenames that is sometimes a sequence of char, sometimes a
>>>>>> sequence of byte.
>>>>>>
>>>>>> So this is unlikely to change.
>>>>>>
>>>>>> But if all you want is reliable reversible conversion,
>>>>>> using java -Dfile.encoding=ISO-8859-1
>>>>>> should do the trick.
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg 
>>>>>> <dstromberglists at gmail.com>
>>>>>> wrote:
>>>>>>          
>>>>>>> Sorry if this is the wrong list for this question.  I tried 
>>>>>>> asking it
>>>>>>> on comp.lang.java, but didn't get very far there.
>>>>>>>
>>>>>>> I've been wanting to expand my horizons a bit by taking one of my
>>>>>>> programs and rewriting it into a number of other languages.  It
>>>>>>> started life in python, and I've recoded it into perl
>>>>>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html).
>>>>>>> Next on my list is java.  After that I'll probably do Haskell and
>>>>>>> Eiffel/Sather.
>>>>>>>
>>>>>>> So the python and perl versions were pretty easy, but I'm 
>>>>>>> finding that
>>>>>>> the java version has a somewhat solution-resistant problem with
>>>>>>> non-ASCII filenames.
>>>>>>>
>>>>>>> The program just reads filenames from stdin (usually generated with
>>>>>>> the *ix find command), and then compares those files, dividing 
>>>>>>> them up
>>>>>>> into equal groups.
>>>>>>>
>>>>>>> The problem with the java version, which manifests both with 
>>>>>>> OpenJDK
>>>>>>> and gcj, is that the filenames being read from disk are 8 bit, 
>>>>>>> and the
>>>>>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 
>>>>>>> bit,
>>>>>>> but as far as the java language is concerned, those filenames 
>>>>>>> are made
>>>>>>> up of 16 bit characters.  That's fine, but going from 8 to 16 
>>>>>>> bit and
>>>>>>> back to 8 bit seems to be non-information-preserving in this case,
>>>>>>> which isn't so fine - I can clearly see the program, in an strace,
>>>>>>> reading with one sequence of bytes, but then trying to open
>>>>>>> another-though-related sequence of bytes.  To be perfectly 
>>>>>>> clear: It's
>>>>>>> getting file not found errors.
>>>>>>>
>>>>>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can 
>>>>>>> get the
>>>>>>> program to handle files with one encoding, but not another.  I've
>>>>>>> tried a bunch of values in these variables, including 
>>>>>>> ISO-8859-1, C,
>>>>>>> POSIX, UTF-8, and so on.
>>>>>>>
>>>>>>> Is there such a thing as a filename encoding that will map 8 bit
>>>>>>> filenames to 16 bit characters, but only using the low 8 bits of 
>>>>>>> those
>>>>>>> 16, and then map back to 8 bit filenames only using those low 8 
>>>>>>> bits
>>>>>>> again?
>>>>>>>
>>>>>>> Is there some other way of making a Java program on Linux able 
>>>>>>> to read
>>>>>>> filenames from stdin and later open those filenames?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>             
>>>>> -- 
>>>>> Naoto Sato
>>>>>
>>>>>         
>>>     
>