<i18n dev> Reading Linux filenames in a way that will map back the same on open?
Dan Stromberg
strombrg at gmail.com
Wed Sep 10 17:50:49 PDT 2008
Would you believe that I'm getting file not found errors even with
ISO-8859-1?
(Naoto: My program doesn't know what encoding to expect - I'm afraid I
probably have different applications writing filenames in different
encodings on my Ubuntu system. I'd been thinking I wanted to treat
filenames as just a sequence of bytes, and let the terminal emulator
interpret the encoding (hopefully) correctly on output).
This gives two file not found tracebacks:
export LC_ALL='ISO-8859-1'
export LC_CTYPE="$LC_ALL"
export LANG="$LC_ALL"
find 'test-files' -type f -print | java -Xmx512M
-Dfile.encoding=ISO-8859-1 -jar equivs.jar equivs.main
find ~/Sound/Music -type f -print | java -Xmx512M
-Dfile.encoding=ISO-8859-1 -jar equivs.jar equivs.main
I'm reading the filenames like (please forgive the weird indentation) :
try
{
while((line = stdin.readLine()) != null)
{
// System.out.println(line);
// System.out.flush();
lst.add(new Sortable_file(line));
}
}
catch(java.io.IOException e)
{
System.err.println("**** exception " + e);
e.printStackTrace();
}
Where Sortable_file's constructor just looks like:
public Sortable_file(String filename)
{
this.filename = filename;
/*
Java doesn't have a stat function without doing some fancy stuff,
so we skip this
optimization. It really only helps with hard links anyway.
this.device = -1
this.inode = -1
*/
File file = new File(this.filename);
this.size = file.length();
// It bothers a little that we can't close this, but perhaps it's
unnecessary. That'll
// be determined in large tests.
// file.close();
this.have_prefix = false;
this.have_hash = false;
}
..and the part that actually blows up looks like:
private void get_prefix()
{
byte[] buffer = new byte[128];
try
{
// The next line is the one that gives file not found
FileInputStream file = new FileInputStream(this.filename);
file.read(buffer);
// System.out.println("this.prefix.length " + this.prefix.length);
file.close();
}
catch (IOException ioe)
{
// System.out.println( "IO error: " + ioe );
ioe.printStackTrace();
System.exit(1);
}
this.prefix = new String(buffer);
this.have_prefix = true;
}
Interestingly, it's already tried to get the file's length without an
error when it goes to read data from the file and has trouble.
I don't -think- I'm doing anything screwy in there - could it be that
ISO-8859-1 isn't giving good round-trip conversions in practice? Would
this be an attribute of the java runtime in question, or could it be a
matter of the locale files on my Ubuntu system being a little off? It
would seem the locale files would be a better explanation (or a bug in
my program I'm not seeing!), since I get the same errors with both
OpenJDK and gcj.
Martin Buchholz wrote:
> ISO-8859-1 guarantees round-trip conversion between bytes and chars,
> guarateeing no loss of data, or getting apparently impossible situations
> where the JDK gives you a list of files in a directory, but you get
> File not found when you try to open them.
>
> If you want to show the file names to users, you can always take
> your ISO-8859-1 decoded strings, turn them back into byte[],
> and decode using UTF-8 later, if you so desired.
> (The basic OS interfaces in the JDK are not so flexible.
> They are hard-coded to use the one charset specified by file.encoding)
>
> Martin
>
> On Wed, Sep 10, 2008 at 14:54, Naoto Sato <Naoto.Sato at sun.com> wrote:
>> Why ISO-8859-1? CJK filenames are guaranteed to fail in that case. I'd
>> rather choose UTF-8, as the default encoding on recent Unix/Linux are all
>> UTF-8 so the filenames are likely in UTF-8.
>>
>> Naoto
>>
>> Martin Buchholz wrote:
>>> Java made the decision to use String as an abstraction
>>> for many OS-specific objects, like filenames (or environment variables).
>>> Most of the time this works fine, but occasionally you can notice
>>> that the underlying OS (in the case of Unix) actually uses
>>> arbitrary byte arrays as filenames.
>>>
>>> It would have been much more confusing to provide an interface
>>> to filenames that is sometimes a sequence of char, sometimes a
>>> sequence of byte.
>>>
>>> So this is unlikely to change.
>>>
>>> But if all you want is reliable reversible conversion,
>>> using java -Dfile.encoding=ISO-8859-1
>>> should do the trick.
>>>
>>> Martin
>>>
>>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg <dstromberglists at gmail.com>
>>> wrote:
>>>> Sorry if this is the wrong list for this question. I tried asking it
>>>> on comp.lang.java, but didn't get very far there.
>>>>
>>>> I've been wanting to expand my horizons a bit by taking one of my
>>>> programs and rewriting it into a number of other languages. It
>>>> started life in python, and I've recoded it into perl
>>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html).
>>>> Next on my list is java. After that I'll probably do Haskell and
>>>> Eiffel/Sather.
>>>>
>>>> So the python and perl versions were pretty easy, but I'm finding that
>>>> the java version has a somewhat solution-resistant problem with
>>>> non-ASCII filenames.
>>>>
>>>> The program just reads filenames from stdin (usually generated with
>>>> the *ix find command), and then compares those files, dividing them up
>>>> into equal groups.
>>>>
>>>> The problem with the java version, which manifests both with OpenJDK
>>>> and gcj, is that the filenames being read from disk are 8 bit, and the
>>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit,
>>>> but as far as the java language is concerned, those filenames are made
>>>> up of 16 bit characters. That's fine, but going from 8 to 16 bit and
>>>> back to 8 bit seems to be non-information-preserving in this case,
>>>> which isn't so fine - I can clearly see the program, in an strace,
>>>> reading with one sequence of bytes, but then trying to open
>>>> another-though-related sequence of bytes. To be perfectly clear: It's
>>>> getting file not found errors.
>>>>
>>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the
>>>> program to handle files with one encoding, but not another. I've
>>>> tried a bunch of values in these variables, including ISO-8859-1, C,
>>>> POSIX, UTF-8, and so on.
>>>>
>>>> Is there such a thing as a filename encoding that will map 8 bit
>>>> filenames to 16 bit characters, but only using the low 8 bits of those
>>>> 16, and then map back to 8 bit filenames only using those low 8 bits
>>>> again?
>>>>
>>>> Is there some other way of making a Java program on Linux able to read
>>>> filenames from stdin and later open those filenames?
>>>>
>>>> Thanks!
>>>>
>>
>> --
>> Naoto Sato
>>
More information about the i18n-dev
mailing list