<i18n dev> Reading Linux filenames in a way that will map back the same on open?

Sat Sep 13 20:39:31 PDT 2008

Obviously your locale setting is not being "exported"...what "shell" are 
you using?

You can try to set your locale to en_US.ISO8859-1 explicitly at command 
line first,
type in "locale" to confirm that your locale is being set correctly to 
en_US.ISO8859-1,
then run the "find + java" to see if that FNF error disappears. If not, 
run the java Foo
again and tell us the result:-)

One possibility is that you don't have a ISO8859-1 locale installed at all?

Sherman

Dan Stromberg wrote:
>
> It still errors with a file not found:
>
> + LC_ALL=en_US.ISO8859-1
> + export LC_ALL
> + find /home/dstromberg/Sound/Music -type f -print
> + java -Xmx512M -jar equivs.jar equivs.main
> java.io.FileNotFoundException: /home/dstromberg/Sound/Music/Various 
> Artists/Dreamland/11 - Canci??n Para Dormir a un Ni??o 
> (Argentina).flac (No such file or directory)
>        at java.io.FileInputStream.open(Native Method)
>        at java.io.FileInputStream.<init>(FileInputStream.java:106)
>        at java.io.FileInputStream.<init>(FileInputStream.java:66)
>        at Sortable_file.get_prefix(Sortable_file.java:56)
>        at Sortable_file.compareTo(Sortable_file.java:159)
>        at Sortable_file.compareTo(Sortable_file.java:1)
>        at java.util.Arrays.mergeSort(Arrays.java:1167)
>        at java.util.Arrays.mergeSort(Arrays.java:1155)
>        at java.util.Arrays.mergeSort(Arrays.java:1155)
>        at java.util.Arrays.mergeSort(Arrays.java:1156)
>        at java.util.Arrays.mergeSort(Arrays.java:1156)
>        at java.util.Arrays.mergeSort(Arrays.java:1155)
>        at java.util.Arrays.mergeSort(Arrays.java:1155)
>        at java.util.Arrays.mergeSort(Arrays.java:1155)
>        at java.util.Arrays.mergeSort(Arrays.java:1155)
>        at java.util.Arrays.mergeSort(Arrays.java:1155)
>        at java.util.Arrays.sort(Arrays.java:1079)
>        at equivs.main(equivs.java:40)
> make: *** [wrapped] Error 1
>
> ...and the foo.java program gives:
>
> $ LC_ALL=en_US.ISO8859-1; export LC_ALL; java foo
> sun.jnu.encoding=ANSI_X3.4-1968
> file.encoding=ANSI_X3.4-1968
> default locale=en_US
>
> Thanks folks.
>
> Xueming Shen wrote:
>>
>> Martin, don't trap people into using -Dfile.encoding, always treat it 
>> as a read only property:-)
>>
>> I believe initializeEncoding(env) gets invoked before -Dxyz=abc  
>> overwrites the default one,
>> beside the "jnu encoding" is introduced in 6.0, so we no longer look 
>> file.encoding since, I believe
>> you "ARE" the reviewer:-)
>>
>> Dan, I kind of feel, switch the locale to a sio8859-1 locale in your 
>> wrapper, for example
>>
>> LC_ALL=en_US.ISO8859-1; export LC_ALL; java -Xmx512M -jar equivs.jar 
>> equivs.main
>>
>> should work, if it does not, can you try to run
>>
>> LC_ALL=en_US.ISO8859-1; export LC_ALL; java Foo
>>
>> with Foo.java
>>
>> System.out.println("sun.jnu.encoding=" + 
>> System.getProperty("sun.jnu.encoding"));
>> System.out.println("file.encoding=" + 
>> System.getProperty("file.encoding"));
>> System.out.println("default locale=" + java.util.Locale.getDefault());
>>
>> Let us know the result?
>>
>> sherman
>>
>>
>> Martin Buchholz wrote:
>>> On Wed, Sep 10, 2008 at 17:50, Dan Stromberg <strombrg at gmail.com> 
>>> wrote:
>>>  
>>>> Would you believe that I'm getting file not found errors even with
>>>> ISO-8859-1?
>>>>     
>>>
>>> The software world is full of suprises.
>>>
>>> Try
>>> export LANG=C LC_ALL=C LC_CTYPE=C
>>> java ... -Dfile.encoding=ISO-8859-1 ...
>>>
>>> You could also be explicit about the
>>> encoding used when doing any kind of char<->byte
>>> conversion, e.g. reading from stdin or writing to stdout.
>>>
>>> Oh, and this is only traditional Unix systems like
>>> Linux and Solaris.  Windows and MacOSX
>>> (at least should) act very differently in this area.
>>>
>>> Martin
>>>
>>>  
>>>> (Naoto: My program doesn't know what encoding to expect - I'm afraid I
>>>> probably have different applications writing filenames in different
>>>> encodings on my Ubuntu system.  I'd been thinking I wanted to treat
>>>> filenames as just a sequence of bytes, and let the terminal emulator
>>>> interpret the encoding (hopefully) correctly on output).
>>>>
>>>>
>>>>
>>>> This gives two file not found tracebacks:
>>>>
>>>> export LC_ALL='ISO-8859-1'
>>>> export LC_CTYPE="$LC_ALL"
>>>> export LANG="$LC_ALL"
>>>>
>>>> find 'test-files' -type f -print | java -Xmx512M 
>>>> -Dfile.encoding=ISO-8859-1
>>>> -jar equivs.jar equivs.main
>>>>
>>>> find ~/Sound/Music -type f -print | java -Xmx512M 
>>>> -Dfile.encoding=ISO-8859-1
>>>> -jar equivs.jar equivs.main
>>>>
>>>>
>>>>
>>>> I'm reading the filenames like (please forgive the weird 
>>>> indentation) :
>>>>
>>>> try                                                        {
>>>>
>>>> while((line = stdin.readLine()) != null)
>>>>        {
>>>>                            // System.out.println(line);
>>>>           // System.out.flush();
>>>> lst.add(new Sortable_file(line));
>>>> }
>>>> }
>>>> catch(java.io.IOException e)
>>>> {
>>>> System.err.println("**** exception " + e);
>>>> e.printStackTrace();                                                     
>>>> }
>>>>
>>>>
>>>>
>>>> Where Sortable_file's constructor just looks like:
>>>>
>>>>   public Sortable_file(String filename)
>>>>      {
>>>>      this.filename = filename;
>>>>      /*
>>>>      Java doesn't have a stat function without doing some fancy 
>>>> stuff, so we
>>>> skip this
>>>>      optimization.  It really only helps with hard links anyway.
>>>>      this.device = -1
>>>>      this.inode = -1
>>>>      */
>>>>      File file = new File(this.filename);
>>>>      this.size = file.length();
>>>>      // It bothers a little that we can't close this, but perhaps it's
>>>> unnecessary.  That'll
>>>>      // be determined in large tests.
>>>>      // file.close();
>>>>      this.have_prefix = false;
>>>>      this.have_hash = false;
>>>>      }
>>>>
>>>>
>>>>
>>>> ..and the part that actually blows up looks like:
>>>>
>>>>   private void get_prefix()
>>>>      {
>>>>      byte[] buffer = new byte[128];
>>>>      try
>>>>         {
>>>>         // The next line is the one that gives file not found
>>>>         FileInputStream file = new FileInputStream(this.filename);
>>>>         file.read(buffer);
>>>>         // System.out.println("this.prefix.length " + 
>>>> this.prefix.length);
>>>>         file.close();
>>>>         }
>>>>      catch (IOException ioe)
>>>>         {
>>>>         // System.out.println( "IO error: " + ioe );
>>>>         ioe.printStackTrace();
>>>>         System.exit(1);
>>>>         }
>>>>      this.prefix = new String(buffer);
>>>>      this.have_prefix = true;
>>>>      }
>>>>
>>>>
>>>>
>>>> Interestingly, it's already tried to get the file's length without 
>>>> an error
>>>> when it goes to read data from the file and has trouble.
>>>>
>>>> I don't -think- I'm doing anything screwy in there - could it be that
>>>> ISO-8859-1 isn't giving good round-trip conversions in practice?  
>>>> Would this
>>>> be an attribute of the java runtime in question, or could it be a 
>>>> matter of
>>>> the locale files on my Ubuntu system being a little off?  It would 
>>>> seem the
>>>> locale files would be a better explanation (or a bug in my program 
>>>> I'm not
>>>> seeing!), since I get the same errors with both OpenJDK and gcj.
>>>>
>>>> Martin Buchholz wrote:
>>>>   
>>>>> ISO-8859-1 guarantees round-trip conversion between bytes and chars,
>>>>> guarateeing no loss of data, or getting apparently impossible 
>>>>> situations
>>>>> where the JDK gives you a list of files in a directory, but you get
>>>>> File not found when you try to open them.
>>>>>
>>>>> If you want to show the file names to users, you can always take
>>>>> your ISO-8859-1 decoded strings, turn them back into byte[],
>>>>> and decode using UTF-8 later, if you so desired.
>>>>> (The basic OS interfaces in the JDK are not so flexible.
>>>>> They are hard-coded to use the one charset specified by 
>>>>> file.encoding)
>>>>>
>>>>> Martin
>>>>>
>>>>> On Wed, Sep 10, 2008 at 14:54, Naoto Sato <Naoto.Sato at sun.com> wrote:
>>>>>     
>>>>>> Why ISO-8859-1?  CJK filenames are guaranteed to fail in that 
>>>>>> case.  I'd
>>>>>> rather choose UTF-8, as the default encoding on recent Unix/Linux 
>>>>>> are all
>>>>>> UTF-8 so the filenames are likely in UTF-8.
>>>>>>
>>>>>> Naoto
>>>>>>
>>>>>> Martin Buchholz wrote:
>>>>>>       
>>>>>>> Java made the decision to use String as an abstraction
>>>>>>> for many OS-specific objects, like filenames (or environment 
>>>>>>> variables).
>>>>>>> Most of the time this works fine, but occasionally you can notice
>>>>>>> that the underlying OS (in the case of Unix) actually uses
>>>>>>> arbitrary byte arrays as filenames.
>>>>>>>
>>>>>>> It would have been much more confusing to provide an interface
>>>>>>> to filenames that is sometimes a sequence of char, sometimes a
>>>>>>> sequence of byte.
>>>>>>>
>>>>>>> So this is unlikely to change.
>>>>>>>
>>>>>>> But if all you want is reliable reversible conversion,
>>>>>>> using java -Dfile.encoding=ISO-8859-1
>>>>>>> should do the trick.
>>>>>>>
>>>>>>> Martin
>>>>>>>
>>>>>>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg 
>>>>>>> <dstromberglists at gmail.com>
>>>>>>> wrote:
>>>>>>>         
>>>>>>>> Sorry if this is the wrong list for this question.  I tried 
>>>>>>>> asking it
>>>>>>>> on comp.lang.java, but didn't get very far there.
>>>>>>>>
>>>>>>>> I've been wanting to expand my horizons a bit by taking one of my
>>>>>>>> programs and rewriting it into a number of other languages.  It
>>>>>>>> started life in python, and I've recoded it into perl
>>>>>>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html). 
>>>>>>>>
>>>>>>>> Next on my list is java.  After that I'll probably do Haskell and
>>>>>>>> Eiffel/Sather.
>>>>>>>>
>>>>>>>> So the python and perl versions were pretty easy, but I'm 
>>>>>>>> finding that
>>>>>>>> the java version has a somewhat solution-resistant problem with
>>>>>>>> non-ASCII filenames.
>>>>>>>>
>>>>>>>> The program just reads filenames from stdin (usually generated 
>>>>>>>> with
>>>>>>>> the *ix find command), and then compares those files, dividing 
>>>>>>>> them up
>>>>>>>> into equal groups.
>>>>>>>>
>>>>>>>> The problem with the java version, which manifests both with 
>>>>>>>> OpenJDK
>>>>>>>> and gcj, is that the filenames being read from disk are 8 bit, 
>>>>>>>> and the
>>>>>>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 
>>>>>>>> 8 bit,
>>>>>>>> but as far as the java language is concerned, those filenames 
>>>>>>>> are made
>>>>>>>> up of 16 bit characters.  That's fine, but going from 8 to 16 
>>>>>>>> bit and
>>>>>>>> back to 8 bit seems to be non-information-preserving in this case,
>>>>>>>> which isn't so fine - I can clearly see the program, in an strace,
>>>>>>>> reading with one sequence of bytes, but then trying to open
>>>>>>>> another-though-related sequence of bytes.  To be perfectly 
>>>>>>>> clear: It's
>>>>>>>> getting file not found errors.
>>>>>>>>
>>>>>>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can 
>>>>>>>> get the
>>>>>>>> program to handle files with one encoding, but not another.  I've
>>>>>>>> tried a bunch of values in these variables, including 
>>>>>>>> ISO-8859-1, C,
>>>>>>>> POSIX, UTF-8, and so on.
>>>>>>>>
>>>>>>>> Is there such a thing as a filename encoding that will map 8 bit
>>>>>>>> filenames to 16 bit characters, but only using the low 8 bits 
>>>>>>>> of those
>>>>>>>> 16, and then map back to 8 bit filenames only using those low 8 
>>>>>>>> bits
>>>>>>>> again?
>>>>>>>>
>>>>>>>> Is there some other way of making a Java program on Linux able 
>>>>>>>> to read
>>>>>>>> filenames from stdin and later open those filenames?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>             
>>>>>> -- 
>>>>>> Naoto Sato
>>>>>>
>>>>>>         
>>>>     
>>
>