<i18n dev> Reading Linux filenames in a way that will map back the same on open?

Fri Sep 12 12:05:36 PDT 2008

Since the thread seems to be trailing off...  Does anyone know of any
mailing lists that might be more appropriate for this question?

Also, is there another OS I should try (perhaps in a little QEMU) for
a point of comparison?  Preferably something that also uses 8 bit
filenames, but would have very different localization data and code
other than the java runtimes themselves?  Does FreeBSD fit this
description?

On Wed, Sep 10, 2008 at 5:52 PM, Dan Stromberg
<dstromberglists at gmail.com> wrote:
>
> Would you believe that I'm getting file not found errors even with
> ISO-8859-1?
>
> (Naoto: My program doesn't know what encoding to expect - I'm afraid I
> probably have different applications writing filenames in different
> encodings on my Ubuntu system.  I'd been thinking I wanted to treat
> filenames as just a sequence of bytes, and let the terminal emulator
> interpret the encoding (hopefully) correctly on output).
>
>
>
> This gives two file not found tracebacks:
>
> export LC_ALL='ISO-8859-1'
> export LC_CTYPE="$LC_ALL"
> export LANG="$LC_ALL"
>
> find 'test-files' -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1
> -jar equivs.jar equivs.main
>
> find ~/Sound/Music -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1
> -jar equivs.jar equivs.main
>
>
>
> I'm reading the filenames like (please forgive the weird indentation) :
>
> try                                                         {
>
> while((line = stdin.readLine()) != null)             {
>                            // System.out.println(line);
>           // System.out.flush();
> lst.add(new Sortable_file(line));
> }
> }
> catch(java.io.IOException e)
> {
> System.err.println("**** exception " + e);
> e.printStackTrace();                                                      }
>
>
>
> Where Sortable_file's constructor just looks like:
>
>   public Sortable_file(String filename)
>      {
>      this.filename = filename;
>      /*
>      Java doesn't have a stat function without doing some fancy stuff, so we
> skip this
>      optimization.  It really only helps with hard links anyway.
>      this.device = -1
>      this.inode = -1
>      */
>      File file = new File(this.filename);
>      this.size = file.length();
>      // It bothers a little that we can't close this, but perhaps it's
> unnecessary.  That'll
>      // be determined in large tests.
>      // file.close();
>      this.have_prefix = false;
>      this.have_hash = false;
>      }
>
>
>
> ..and the part that actually blows up looks like:
>
>   private void get_prefix()
>      {
>      byte[] buffer = new byte[128];
>      try
>         {
>         // The next line is the one that gives file not found
>         FileInputStream file = new FileInputStream(this.filename);
>         file.read(buffer);
>         // System.out.println("this.prefix.length " + this.prefix.length);
>         file.close();
>         }
>      catch (IOException ioe)
>         {
>         // System.out.println( "IO error: " + ioe );
>         ioe.printStackTrace();
>         System.exit(1);
>         }
>      this.prefix = new String(buffer);
>      this.have_prefix = true;
>      }
>
>
>
> Interestingly, it's already tried to get the file's length without an error
> when it goes to read data from the file and has trouble.
>
> I don't -think- I'm doing anything screwy in there - could it be that
> ISO-8859-1 isn't giving good round-trip conversions in practice?  Would this
> be an attribute of the java runtime in question, or could it be a matter of
> the locale files on my Ubuntu system being a little off?  It would seem the
> locale files would be a better explanation (or a bug in my program I'm not
> seeing!), since I get the same errors with both OpenJDK and gcj.
>
> Martin Buchholz wrote:
>>
>> ISO-8859-1 guarantees round-trip conversion between bytes and chars,
>> guarateeing no loss of data, or getting apparently impossible situations
>> where the JDK gives you a list of files in a directory, but you get
>> File not found when you try to open them.
>>
>> If you want to show the file names to users, you can always take
>> your ISO-8859-1 decoded strings, turn them back into byte[],
>> and decode using UTF-8 later, if you so desired.
>> (The basic OS interfaces in the JDK are not so flexible.
>> They are hard-coded to use the one charset specified by file.encoding)
>>
>> Martin
>>
>> On Wed, Sep 10, 2008 at 14:54, Naoto Sato <Naoto.Sato at sun.com> wrote:
>>>
>>> Why ISO-8859-1?  CJK filenames are guaranteed to fail in that case.  I'd
>>> rather choose UTF-8, as the default encoding on recent Unix/Linux are all
>>> UTF-8 so the filenames are likely in UTF-8.
>>>
>>> Naoto
>>>
>>> Martin Buchholz wrote:
>>>>
>>>> Java made the decision to use String as an abstraction
>>>> for many OS-specific objects, like filenames (or environment variables).
>>>> Most of the time this works fine, but occasionally you can notice
>>>> that the underlying OS (in the case of Unix) actually uses
>>>> arbitrary byte arrays as filenames.
>>>>
>>>> It would have been much more confusing to provide an interface
>>>> to filenames that is sometimes a sequence of char, sometimes a
>>>> sequence of byte.
>>>>
>>>> So this is unlikely to change.
>>>>
>>>> But if all you want is reliable reversible conversion,
>>>> using java -Dfile.encoding=ISO-8859-1
>>>> should do the trick.
>>>>
>>>> Martin
>>>>
>>>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg <dstromberglists at gmail.com>
>>>> wrote:
>>>>>
>>>>> Sorry if this is the wrong list for this question.  I tried asking it
>>>>> on comp.lang.java, but didn't get very far there.
>>>>>
>>>>> I've been wanting to expand my horizons a bit by taking one of my
>>>>> programs and rewriting it into a number of other languages.  It
>>>>> started life in python, and I've recoded it into perl
>>>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html).
>>>>> Next on my list is java.  After that I'll probably do Haskell and
>>>>> Eiffel/Sather.
>>>>>
>>>>> So the python and perl versions were pretty easy, but I'm finding that
>>>>> the java version has a somewhat solution-resistant problem with
>>>>> non-ASCII filenames.
>>>>>
>>>>> The program just reads filenames from stdin (usually generated with
>>>>> the *ix find command), and then compares those files, dividing them up
>>>>> into equal groups.
>>>>>
>>>>> The problem with the java version, which manifests both with OpenJDK
>>>>> and gcj, is that the filenames being read from disk are 8 bit, and the
>>>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit,
>>>>> but as far as the java language is concerned, those filenames are made
>>>>> up of 16 bit characters.  That's fine, but going from 8 to 16 bit and
>>>>> back to 8 bit seems to be non-information-preserving in this case,
>>>>> which isn't so fine - I can clearly see the program, in an strace,
>>>>> reading with one sequence of bytes, but then trying to open
>>>>> another-though-related sequence of bytes.  To be perfectly clear: It's
>>>>> getting file not found errors.
>>>>>
>>>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the
>>>>> program to handle files with one encoding, but not another.  I've
>>>>> tried a bunch of values in these variables, including ISO-8859-1, C,
>>>>> POSIX, UTF-8, and so on.
>>>>>
>>>>> Is there such a thing as a filename encoding that will map 8 bit
>>>>> filenames to 16 bit characters, but only using the low 8 bits of those
>>>>> 16, and then map back to 8 bit filenames only using those low 8 bits
>>>>> again?
>>>>>
>>>>> Is there some other way of making a Java program on Linux able to read
>>>>> filenames from stdin and later open those filenames?
>>>>>
>>>>> Thanks!
>>>>>
>>>
>>> --
>>> Naoto Sato
>>>
>
>
>