From dstromberglists at gmail.com Tue Sep 9 17:39:14 2008 From: dstromberglists at gmail.com (Dan Stromberg) Date: Tue, 9 Sep 2008 17:39:14 -0700 Subject: Reading Linux filenames in a way that will map back the same on open? Message-ID: <33c5a6c30809091739j41fad37bybfc92708fa3d0cca@mail.gmail.com> Sorry if this is the wrong list for this question. I tried asking it on comp.lang.java, but didn't get very far there. I've been wanting to expand my horizons a bit by taking one of my programs and rewriting it into a number of other languages. It started life in python, and I've recoded it into perl (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html). Next on my list is java. After that I'll probably do Haskell and Eiffel/Sather. So the python and perl versions were pretty easy, but I'm finding that the java version has a somewhat solution-resistant problem with non-ASCII filenames. The program just reads filenames from stdin (usually generated with the *ix find command), and then compares those files, dividing them up into equal groups. The problem with the java version, which manifests both with OpenJDK and gcj, is that the filenames being read from disk are 8 bit, and the filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit, but as far as the java language is concerned, those filenames are made up of 16 bit characters. That's fine, but going from 8 to 16 bit and back to 8 bit seems to be non-information-preserving in this case, which isn't so fine - I can clearly see the program, in an strace, reading with one sequence of bytes, but then trying to open another-though-related sequence of bytes. To be perfectly clear: It's getting file not found errors. By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the program to handle files with one encoding, but not another. I've tried a bunch of values in these variables, including ISO-8859-1, C, POSIX, UTF-8, and so on. Is there such a thing as a filename encoding that will map 8 bit filenames to 16 bit characters, but only using the low 8 bits of those 16, and then map back to 8 bit filenames only using those low 8 bits again? Is there some other way of making a Java program on Linux able to read filenames from stdin and later open those filenames? Thanks! From martinrb at google.com Tue Sep 9 20:58:02 2008 From: martinrb at google.com (Martin Buchholz) Date: Tue, 9 Sep 2008 20:58:02 -0700 Subject: Reading Linux filenames in a way that will map back the same on open? In-Reply-To: <33c5a6c30809091739j41fad37bybfc92708fa3d0cca@mail.gmail.com> References: <33c5a6c30809091739j41fad37bybfc92708fa3d0cca@mail.gmail.com> Message-ID: <1ccfd1c10809092058l427e0cd3kb73611878026e6eb@mail.gmail.com> Java made the decision to use String as an abstraction for many OS-specific objects, like filenames (or environment variables). Most of the time this works fine, but occasionally you can notice that the underlying OS (in the case of Unix) actually uses arbitrary byte arrays as filenames. It would have been much more confusing to provide an interface to filenames that is sometimes a sequence of char, sometimes a sequence of byte. So this is unlikely to change. But if all you want is reliable reversible conversion, using java -Dfile.encoding=ISO-8859-1 should do the trick. Martin On Tue, Sep 9, 2008 at 17:39, Dan Stromberg wrote: > Sorry if this is the wrong list for this question. I tried asking it > on comp.lang.java, but didn't get very far there. > > I've been wanting to expand my horizons a bit by taking one of my > programs and rewriting it into a number of other languages. It > started life in python, and I've recoded it into perl > (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html). > Next on my list is java. After that I'll probably do Haskell and > Eiffel/Sather. > > So the python and perl versions were pretty easy, but I'm finding that > the java version has a somewhat solution-resistant problem with > non-ASCII filenames. > > The program just reads filenames from stdin (usually generated with > the *ix find command), and then compares those files, dividing them up > into equal groups. > > The problem with the java version, which manifests both with OpenJDK > and gcj, is that the filenames being read from disk are 8 bit, and the > filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit, > but as far as the java language is concerned, those filenames are made > up of 16 bit characters. That's fine, but going from 8 to 16 bit and > back to 8 bit seems to be non-information-preserving in this case, > which isn't so fine - I can clearly see the program, in an strace, > reading with one sequence of bytes, but then trying to open > another-though-related sequence of bytes. To be perfectly clear: It's > getting file not found errors. > > By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the > program to handle files with one encoding, but not another. I've > tried a bunch of values in these variables, including ISO-8859-1, C, > POSIX, UTF-8, and so on. > > Is there such a thing as a filename encoding that will map 8 bit > filenames to 16 bit characters, but only using the low 8 bits of those > 16, and then map back to 8 bit filenames only using those low 8 bits > again? > > Is there some other way of making a Java program on Linux able to read > filenames from stdin and later open those filenames? > > Thanks! > From Naoto.Sato at Sun.COM Wed Sep 10 14:54:31 2008 From: Naoto.Sato at Sun.COM (Naoto Sato) Date: Wed, 10 Sep 2008 14:54:31 -0700 Subject: Reading Linux filenames in a way that will map back the same on open? In-Reply-To: <1ccfd1c10809092058l427e0cd3kb73611878026e6eb@mail.gmail.com> References: <33c5a6c30809091739j41fad37bybfc92708fa3d0cca@mail.gmail.com> <1ccfd1c10809092058l427e0cd3kb73611878026e6eb@mail.gmail.com> Message-ID: <48C84217.9000709@Sun.COM> Why ISO-8859-1? CJK filenames are guaranteed to fail in that case. I'd rather choose UTF-8, as the default encoding on recent Unix/Linux are all UTF-8 so the filenames are likely in UTF-8. Naoto Martin Buchholz wrote: > Java made the decision to use String as an abstraction > for many OS-specific objects, like filenames (or environment variables). > Most of the time this works fine, but occasionally you can notice > that the underlying OS (in the case of Unix) actually uses > arbitrary byte arrays as filenames. > > It would have been much more confusing to provide an interface > to filenames that is sometimes a sequence of char, sometimes a > sequence of byte. > > So this is unlikely to change. > > But if all you want is reliable reversible conversion, > using java -Dfile.encoding=ISO-8859-1 > should do the trick. > > Martin > > On Tue, Sep 9, 2008 at 17:39, Dan Stromberg wrote: >> Sorry if this is the wrong list for this question. I tried asking it >> on comp.lang.java, but didn't get very far there. >> >> I've been wanting to expand my horizons a bit by taking one of my >> programs and rewriting it into a number of other languages. It >> started life in python, and I've recoded it into perl >> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html). >> Next on my list is java. After that I'll probably do Haskell and >> Eiffel/Sather. >> >> So the python and perl versions were pretty easy, but I'm finding that >> the java version has a somewhat solution-resistant problem with >> non-ASCII filenames. >> >> The program just reads filenames from stdin (usually generated with >> the *ix find command), and then compares those files, dividing them up >> into equal groups. >> >> The problem with the java version, which manifests both with OpenJDK >> and gcj, is that the filenames being read from disk are 8 bit, and the >> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit, >> but as far as the java language is concerned, those filenames are made >> up of 16 bit characters. That's fine, but going from 8 to 16 bit and >> back to 8 bit seems to be non-information-preserving in this case, >> which isn't so fine - I can clearly see the program, in an strace, >> reading with one sequence of bytes, but then trying to open >> another-though-related sequence of bytes. To be perfectly clear: It's >> getting file not found errors. >> >> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the >> program to handle files with one encoding, but not another. I've >> tried a bunch of values in these variables, including ISO-8859-1, C, >> POSIX, UTF-8, and so on. >> >> Is there such a thing as a filename encoding that will map 8 bit >> filenames to 16 bit characters, but only using the low 8 bits of those >> 16, and then map back to 8 bit filenames only using those low 8 bits >> again? >> >> Is there some other way of making a Java program on Linux able to read >> filenames from stdin and later open those filenames? >> >> Thanks! >> -- Naoto Sato From martinrb at google.com Wed Sep 10 15:14:16 2008 From: martinrb at google.com (Martin Buchholz) Date: Wed, 10 Sep 2008 15:14:16 -0700 Subject: Reading Linux filenames in a way that will map back the same on open? In-Reply-To: <48C84217.9000709@Sun.COM> References: <33c5a6c30809091739j41fad37bybfc92708fa3d0cca@mail.gmail.com> <1ccfd1c10809092058l427e0cd3kb73611878026e6eb@mail.gmail.com> <48C84217.9000709@Sun.COM> Message-ID: <1ccfd1c10809101514g545d2896lb24ed634eca089b5@mail.gmail.com> ISO-8859-1 guarantees round-trip conversion between bytes and chars, guarateeing no loss of data, or getting apparently impossible situations where the JDK gives you a list of files in a directory, but you get File not found when you try to open them. If you want to show the file names to users, you can always take your ISO-8859-1 decoded strings, turn them back into byte[], and decode using UTF-8 later, if you so desired. (The basic OS interfaces in the JDK are not so flexible. They are hard-coded to use the one charset specified by file.encoding) Martin On Wed, Sep 10, 2008 at 14:54, Naoto Sato wrote: > Why ISO-8859-1? CJK filenames are guaranteed to fail in that case. I'd > rather choose UTF-8, as the default encoding on recent Unix/Linux are all > UTF-8 so the filenames are likely in UTF-8. > > Naoto > > Martin Buchholz wrote: >> >> Java made the decision to use String as an abstraction >> for many OS-specific objects, like filenames (or environment variables). >> Most of the time this works fine, but occasionally you can notice >> that the underlying OS (in the case of Unix) actually uses >> arbitrary byte arrays as filenames. >> >> It would have been much more confusing to provide an interface >> to filenames that is sometimes a sequence of char, sometimes a >> sequence of byte. >> >> So this is unlikely to change. >> >> But if all you want is reliable reversible conversion, >> using java -Dfile.encoding=ISO-8859-1 >> should do the trick. >> >> Martin >> >> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg >> wrote: >>> >>> Sorry if this is the wrong list for this question. I tried asking it >>> on comp.lang.java, but didn't get very far there. >>> >>> I've been wanting to expand my horizons a bit by taking one of my >>> programs and rewriting it into a number of other languages. It >>> started life in python, and I've recoded it into perl >>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html). >>> Next on my list is java. After that I'll probably do Haskell and >>> Eiffel/Sather. >>> >>> So the python and perl versions were pretty easy, but I'm finding that >>> the java version has a somewhat solution-resistant problem with >>> non-ASCII filenames. >>> >>> The program just reads filenames from stdin (usually generated with >>> the *ix find command), and then compares those files, dividing them up >>> into equal groups. >>> >>> The problem with the java version, which manifests both with OpenJDK >>> and gcj, is that the filenames being read from disk are 8 bit, and the >>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit, >>> but as far as the java language is concerned, those filenames are made >>> up of 16 bit characters. That's fine, but going from 8 to 16 bit and >>> back to 8 bit seems to be non-information-preserving in this case, >>> which isn't so fine - I can clearly see the program, in an strace, >>> reading with one sequence of bytes, but then trying to open >>> another-though-related sequence of bytes. To be perfectly clear: It's >>> getting file not found errors. >>> >>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the >>> program to handle files with one encoding, but not another. I've >>> tried a bunch of values in these variables, including ISO-8859-1, C, >>> POSIX, UTF-8, and so on. >>> >>> Is there such a thing as a filename encoding that will map 8 bit >>> filenames to 16 bit characters, but only using the low 8 bits of those >>> 16, and then map back to 8 bit filenames only using those low 8 bits >>> again? >>> >>> Is there some other way of making a Java program on Linux able to read >>> filenames from stdin and later open those filenames? >>> >>> Thanks! >>> > > > -- > Naoto Sato > From strombrg at gmail.com Wed Sep 10 17:50:49 2008 From: strombrg at gmail.com (Dan Stromberg) Date: Wed, 10 Sep 2008 17:50:49 -0700 Subject: Reading Linux filenames in a way that will map back the same on open? In-Reply-To: <1ccfd1c10809101514g545d2896lb24ed634eca089b5@mail.gmail.com> References: <33c5a6c30809091739j41fad37bybfc92708fa3d0cca@mail.gmail.com> <1ccfd1c10809092058l427e0cd3kb73611878026e6eb@mail.gmail.com> <48C84217.9000709@Sun.COM> <1ccfd1c10809101514g545d2896lb24ed634eca089b5@mail.gmail.com> Message-ID: <48C86B69.2020501@gmail.com> Would you believe that I'm getting file not found errors even with ISO-8859-1? (Naoto: My program doesn't know what encoding to expect - I'm afraid I probably have different applications writing filenames in different encodings on my Ubuntu system. I'd been thinking I wanted to treat filenames as just a sequence of bytes, and let the terminal emulator interpret the encoding (hopefully) correctly on output). This gives two file not found tracebacks: export LC_ALL='ISO-8859-1' export LC_CTYPE="$LC_ALL" export LANG="$LC_ALL" find 'test-files' -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1 -jar equivs.jar equivs.main find ~/Sound/Music -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1 -jar equivs.jar equivs.main I'm reading the filenames like (please forgive the weird indentation) : try { while((line = stdin.readLine()) != null) { // System.out.println(line); // System.out.flush(); lst.add(new Sortable_file(line)); } } catch(java.io.IOException e) { System.err.println("**** exception " + e); e.printStackTrace(); } Where Sortable_file's constructor just looks like: public Sortable_file(String filename) { this.filename = filename; /* Java doesn't have a stat function without doing some fancy stuff, so we skip this optimization. It really only helps with hard links anyway. this.device = -1 this.inode = -1 */ File file = new File(this.filename); this.size = file.length(); // It bothers a little that we can't close this, but perhaps it's unnecessary. That'll // be determined in large tests. // file.close(); this.have_prefix = false; this.have_hash = false; } ..and the part that actually blows up looks like: private void get_prefix() { byte[] buffer = new byte[128]; try { // The next line is the one that gives file not found FileInputStream file = new FileInputStream(this.filename); file.read(buffer); // System.out.println("this.prefix.length " + this.prefix.length); file.close(); } catch (IOException ioe) { // System.out.println( "IO error: " + ioe ); ioe.printStackTrace(); System.exit(1); } this.prefix = new String(buffer); this.have_prefix = true; } Interestingly, it's already tried to get the file's length without an error when it goes to read data from the file and has trouble. I don't -think- I'm doing anything screwy in there - could it be that ISO-8859-1 isn't giving good round-trip conversions in practice? Would this be an attribute of the java runtime in question, or could it be a matter of the locale files on my Ubuntu system being a little off? It would seem the locale files would be a better explanation (or a bug in my program I'm not seeing!), since I get the same errors with both OpenJDK and gcj. Martin Buchholz wrote: > ISO-8859-1 guarantees round-trip conversion between bytes and chars, > guarateeing no loss of data, or getting apparently impossible situations > where the JDK gives you a list of files in a directory, but you get > File not found when you try to open them. > > If you want to show the file names to users, you can always take > your ISO-8859-1 decoded strings, turn them back into byte[], > and decode using UTF-8 later, if you so desired. > (The basic OS interfaces in the JDK are not so flexible. > They are hard-coded to use the one charset specified by file.encoding) > > Martin > > On Wed, Sep 10, 2008 at 14:54, Naoto Sato wrote: >> Why ISO-8859-1? CJK filenames are guaranteed to fail in that case. I'd >> rather choose UTF-8, as the default encoding on recent Unix/Linux are all >> UTF-8 so the filenames are likely in UTF-8. >> >> Naoto >> >> Martin Buchholz wrote: >>> Java made the decision to use String as an abstraction >>> for many OS-specific objects, like filenames (or environment variables). >>> Most of the time this works fine, but occasionally you can notice >>> that the underlying OS (in the case of Unix) actually uses >>> arbitrary byte arrays as filenames. >>> >>> It would have been much more confusing to provide an interface >>> to filenames that is sometimes a sequence of char, sometimes a >>> sequence of byte. >>> >>> So this is unlikely to change. >>> >>> But if all you want is reliable reversible conversion, >>> using java -Dfile.encoding=ISO-8859-1 >>> should do the trick. >>> >>> Martin >>> >>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg >>> wrote: >>>> Sorry if this is the wrong list for this question. I tried asking it >>>> on comp.lang.java, but didn't get very far there. >>>> >>>> I've been wanting to expand my horizons a bit by taking one of my >>>> programs and rewriting it into a number of other languages. It >>>> started life in python, and I've recoded it into perl >>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html). >>>> Next on my list is java. After that I'll probably do Haskell and >>>> Eiffel/Sather. >>>> >>>> So the python and perl versions were pretty easy, but I'm finding that >>>> the java version has a somewhat solution-resistant problem with >>>> non-ASCII filenames. >>>> >>>> The program just reads filenames from stdin (usually generated with >>>> the *ix find command), and then compares those files, dividing them up >>>> into equal groups. >>>> >>>> The problem with the java version, which manifests both with OpenJDK >>>> and gcj, is that the filenames being read from disk are 8 bit, and the >>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit, >>>> but as far as the java language is concerned, those filenames are made >>>> up of 16 bit characters. That's fine, but going from 8 to 16 bit and >>>> back to 8 bit seems to be non-information-preserving in this case, >>>> which isn't so fine - I can clearly see the program, in an strace, >>>> reading with one sequence of bytes, but then trying to open >>>> another-though-related sequence of bytes. To be perfectly clear: It's >>>> getting file not found errors. >>>> >>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the >>>> program to handle files with one encoding, but not another. I've >>>> tried a bunch of values in these variables, including ISO-8859-1, C, >>>> POSIX, UTF-8, and so on. >>>> >>>> Is there such a thing as a filename encoding that will map 8 bit >>>> filenames to 16 bit characters, but only using the low 8 bits of those >>>> 16, and then map back to 8 bit filenames only using those low 8 bits >>>> again? >>>> >>>> Is there some other way of making a Java program on Linux able to read >>>> filenames from stdin and later open those filenames? >>>> >>>> Thanks! >>>> >> >> -- >> Naoto Sato >> From strombrg at gmail.com Wed Sep 10 17:52:48 2008 From: strombrg at gmail.com (Dan Stromberg) Date: Wed, 10 Sep 2008 17:52:48 -0700 Subject: Reading Linux filenames in a way that will map back the same on open? In-Reply-To: <1ccfd1c10809101514g545d2896lb24ed634eca089b5@mail.gmail.com> References: <33c5a6c30809091739j41fad37bybfc92708fa3d0cca@mail.gmail.com> <1ccfd1c10809092058l427e0cd3kb73611878026e6eb@mail.gmail.com> <48C84217.9000709@Sun.COM> <1ccfd1c10809101514g545d2896lb24ed634eca089b5@mail.gmail.com> Message-ID: <48C86BE0.9070905@gmail.com> Would you believe that I'm getting file not found errors even with ISO-8859-1? (Naoto: My program doesn't know what encoding to expect - I'm afraid I probably have different applications writing filenames in different encodings on my Ubuntu system. I'd been thinking I wanted to treat filenames as just a sequence of bytes, and let the terminal emulator interpret the encoding (hopefully) correctly on output). This gives two file not found tracebacks: export LC_ALL='ISO-8859-1' export LC_CTYPE="$LC_ALL" export LANG="$LC_ALL" find 'test-files' -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1 -jar equivs.jar equivs.main find ~/Sound/Music -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1 -jar equivs.jar equivs.main I'm reading the filenames like (please forgive the weird indentation) : try { while((line = stdin.readLine()) != null) { // System.out.println(line); // System.out.flush(); lst.add(new Sortable_file(line)); } } catch(java.io.IOException e) { System.err.println("**** exception " + e); e.printStackTrace(); } Where Sortable_file's constructor just looks like: public Sortable_file(String filename) { this.filename = filename; /* Java doesn't have a stat function without doing some fancy stuff, so we skip this optimization. It really only helps with hard links anyway. this.device = -1 this.inode = -1 */ File file = new File(this.filename); this.size = file.length(); // It bothers a little that we can't close this, but perhaps it's unnecessary. That'll // be determined in large tests. // file.close(); this.have_prefix = false; this.have_hash = false; } ..and the part that actually blows up looks like: private void get_prefix() { byte[] buffer = new byte[128]; try { // The next line is the one that gives file not found FileInputStream file = new FileInputStream(this.filename); file.read(buffer); // System.out.println("this.prefix.length " + this.prefix.length); file.close(); } catch (IOException ioe) { // System.out.println( "IO error: " + ioe ); ioe.printStackTrace(); System.exit(1); } this.prefix = new String(buffer); this.have_prefix = true; } Interestingly, it's already tried to get the file's length without an error when it goes to read data from the file and has trouble. I don't -think- I'm doing anything screwy in there - could it be that ISO-8859-1 isn't giving good round-trip conversions in practice? Would this be an attribute of the java runtime in question, or could it be a matter of the locale files on my Ubuntu system being a little off? It would seem the locale files would be a better explanation (or a bug in my program I'm not seeing!), since I get the same errors with both OpenJDK and gcj. Martin Buchholz wrote: > ISO-8859-1 guarantees round-trip conversion between bytes and chars, > guarateeing no loss of data, or getting apparently impossible situations > where the JDK gives you a list of files in a directory, but you get > File not found when you try to open them. > > If you want to show the file names to users, you can always take > your ISO-8859-1 decoded strings, turn them back into byte[], > and decode using UTF-8 later, if you so desired. > (The basic OS interfaces in the JDK are not so flexible. > They are hard-coded to use the one charset specified by file.encoding) > > Martin > > On Wed, Sep 10, 2008 at 14:54, Naoto Sato wrote: >> Why ISO-8859-1? CJK filenames are guaranteed to fail in that case. I'd >> rather choose UTF-8, as the default encoding on recent Unix/Linux are all >> UTF-8 so the filenames are likely in UTF-8. >> >> Naoto >> >> Martin Buchholz wrote: >>> Java made the decision to use String as an abstraction >>> for many OS-specific objects, like filenames (or environment variables). >>> Most of the time this works fine, but occasionally you can notice >>> that the underlying OS (in the case of Unix) actually uses >>> arbitrary byte arrays as filenames. >>> >>> It would have been much more confusing to provide an interface >>> to filenames that is sometimes a sequence of char, sometimes a >>> sequence of byte. >>> >>> So this is unlikely to change. >>> >>> But if all you want is reliable reversible conversion, >>> using java -Dfile.encoding=ISO-8859-1 >>> should do the trick. >>> >>> Martin >>> >>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg >>> wrote: >>>> Sorry if this is the wrong list for this question. I tried asking it >>>> on comp.lang.java, but didn't get very far there. >>>> >>>> I've been wanting to expand my horizons a bit by taking one of my >>>> programs and rewriting it into a number of other languages. It >>>> started life in python, and I've recoded it into perl >>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html). >>>> Next on my list is java. After that I'll probably do Haskell and >>>> Eiffel/Sather. >>>> >>>> So the python and perl versions were pretty easy, but I'm finding that >>>> the java version has a somewhat solution-resistant problem with >>>> non-ASCII filenames. >>>> >>>> The program just reads filenames from stdin (usually generated with >>>> the *ix find command), and then compares those files, dividing them up >>>> into equal groups. >>>> >>>> The problem with the java version, which manifests both with OpenJDK >>>> and gcj, is that the filenames being read from disk are 8 bit, and the >>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit, >>>> but as far as the java language is concerned, those filenames are made >>>> up of 16 bit characters. That's fine, but going from 8 to 16 bit and >>>> back to 8 bit seems to be non-information-preserving in this case, >>>> which isn't so fine - I can clearly see the program, in an strace, >>>> reading with one sequence of bytes, but then trying to open >>>> another-though-related sequence of bytes. To be perfectly clear: It's >>>> getting file not found errors. >>>> >>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the >>>> program to handle files with one encoding, but not another. I've >>>> tried a bunch of values in these variables, including ISO-8859-1, C, >>>> POSIX, UTF-8, and so on. >>>> >>>> Is there such a thing as a filename encoding that will map 8 bit >>>> filenames to 16 bit characters, but only using the low 8 bits of those >>>> 16, and then map back to 8 bit filenames only using those low 8 bits >>>> again? >>>> >>>> Is there some other way of making a Java program on Linux able to read >>>> filenames from stdin and later open those filenames? >>>> >>>> Thanks! >>>> >> >> -- >> Naoto Sato >> From dstromberglists at gmail.com Fri Sep 12 12:05:36 2008 From: dstromberglists at gmail.com (Dan Stromberg) Date: Fri, 12 Sep 2008 12:05:36 -0700 Subject: Reading Linux filenames in a way that will map back the same on open? In-Reply-To: <48C86BE0.9070905@gmail.com> References: <33c5a6c30809091739j41fad37bybfc92708fa3d0cca@mail.gmail.com> <1ccfd1c10809092058l427e0cd3kb73611878026e6eb@mail.gmail.com> <48C84217.9000709@Sun.COM> <1ccfd1c10809101514g545d2896lb24ed634eca089b5@mail.gmail.com> <48C86BE0.9070905@gmail.com> Message-ID: <33c5a6c30809121205s3c18d335oada44d9ff69c4462@mail.gmail.com> Since the thread seems to be trailing off... Does anyone know of any mailing lists that might be more appropriate for this question? Also, is there another OS I should try (perhaps in a little QEMU) for a point of comparison? Preferably something that also uses 8 bit filenames, but would have very different localization data and code other than the java runtimes themselves? Does FreeBSD fit this description? On Wed, Sep 10, 2008 at 5:52 PM, Dan Stromberg wrote: > > Would you believe that I'm getting file not found errors even with > ISO-8859-1? > > (Naoto: My program doesn't know what encoding to expect - I'm afraid I > probably have different applications writing filenames in different > encodings on my Ubuntu system. I'd been thinking I wanted to treat > filenames as just a sequence of bytes, and let the terminal emulator > interpret the encoding (hopefully) correctly on output). > > > > This gives two file not found tracebacks: > > export LC_ALL='ISO-8859-1' > export LC_CTYPE="$LC_ALL" > export LANG="$LC_ALL" > > find 'test-files' -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1 > -jar equivs.jar equivs.main > > find ~/Sound/Music -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1 > -jar equivs.jar equivs.main > > > > I'm reading the filenames like (please forgive the weird indentation) : > > try { > > while((line = stdin.readLine()) != null) { > // System.out.println(line); > // System.out.flush(); > lst.add(new Sortable_file(line)); > } > } > catch(java.io.IOException e) > { > System.err.println("**** exception " + e); > e.printStackTrace(); } > > > > Where Sortable_file's constructor just looks like: > > public Sortable_file(String filename) > { > this.filename = filename; > /* > Java doesn't have a stat function without doing some fancy stuff, so we > skip this > optimization. It really only helps with hard links anyway. > this.device = -1 > this.inode = -1 > */ > File file = new File(this.filename); > this.size = file.length(); > // It bothers a little that we can't close this, but perhaps it's > unnecessary. That'll > // be determined in large tests. > // file.close(); > this.have_prefix = false; > this.have_hash = false; > } > > > > ..and the part that actually blows up looks like: > > private void get_prefix() > { > byte[] buffer = new byte[128]; > try > { > // The next line is the one that gives file not found > FileInputStream file = new FileInputStream(this.filename); > file.read(buffer); > // System.out.println("this.prefix.length " + this.prefix.length); > file.close(); > } > catch (IOException ioe) > { > // System.out.println( "IO error: " + ioe ); > ioe.printStackTrace(); > System.exit(1); > } > this.prefix = new String(buffer); > this.have_prefix = true; > } > > > > Interestingly, it's already tried to get the file's length without an error > when it goes to read data from the file and has trouble. > > I don't -think- I'm doing anything screwy in there - could it be that > ISO-8859-1 isn't giving good round-trip conversions in practice? Would this > be an attribute of the java runtime in question, or could it be a matter of > the locale files on my Ubuntu system being a little off? It would seem the > locale files would be a better explanation (or a bug in my program I'm not > seeing!), since I get the same errors with both OpenJDK and gcj. > > Martin Buchholz wrote: >> >> ISO-8859-1 guarantees round-trip conversion between bytes and chars, >> guarateeing no loss of data, or getting apparently impossible situations >> where the JDK gives you a list of files in a directory, but you get >> File not found when you try to open them. >> >> If you want to show the file names to users, you can always take >> your ISO-8859-1 decoded strings, turn them back into byte[], >> and decode using UTF-8 later, if you so desired. >> (The basic OS interfaces in the JDK are not so flexible. >> They are hard-coded to use the one charset specified by file.encoding) >> >> Martin >> >> On Wed, Sep 10, 2008 at 14:54, Naoto Sato wrote: >>> >>> Why ISO-8859-1? CJK filenames are guaranteed to fail in that case. I'd >>> rather choose UTF-8, as the default encoding on recent Unix/Linux are all >>> UTF-8 so the filenames are likely in UTF-8. >>> >>> Naoto >>> >>> Martin Buchholz wrote: >>>> >>>> Java made the decision to use String as an abstraction >>>> for many OS-specific objects, like filenames (or environment variables). >>>> Most of the time this works fine, but occasionally you can notice >>>> that the underlying OS (in the case of Unix) actually uses >>>> arbitrary byte arrays as filenames. >>>> >>>> It would have been much more confusing to provide an interface >>>> to filenames that is sometimes a sequence of char, sometimes a >>>> sequence of byte. >>>> >>>> So this is unlikely to change. >>>> >>>> But if all you want is reliable reversible conversion, >>>> using java -Dfile.encoding=ISO-8859-1 >>>> should do the trick. >>>> >>>> Martin >>>> >>>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg >>>> wrote: >>>>> >>>>> Sorry if this is the wrong list for this question. I tried asking it >>>>> on comp.lang.java, but didn't get very far there. >>>>> >>>>> I've been wanting to expand my horizons a bit by taking one of my >>>>> programs and rewriting it into a number of other languages. It >>>>> started life in python, and I've recoded it into perl >>>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html). >>>>> Next on my list is java. After that I'll probably do Haskell and >>>>> Eiffel/Sather. >>>>> >>>>> So the python and perl versions were pretty easy, but I'm finding that >>>>> the java version has a somewhat solution-resistant problem with >>>>> non-ASCII filenames. >>>>> >>>>> The program just reads filenames from stdin (usually generated with >>>>> the *ix find command), and then compares those files, dividing them up >>>>> into equal groups. >>>>> >>>>> The problem with the java version, which manifests both with OpenJDK >>>>> and gcj, is that the filenames being read from disk are 8 bit, and the >>>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit, >>>>> but as far as the java language is concerned, those filenames are made >>>>> up of 16 bit characters. That's fine, but going from 8 to 16 bit and >>>>> back to 8 bit seems to be non-information-preserving in this case, >>>>> which isn't so fine - I can clearly see the program, in an strace, >>>>> reading with one sequence of bytes, but then trying to open >>>>> another-though-related sequence of bytes. To be perfectly clear: It's >>>>> getting file not found errors. >>>>> >>>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the >>>>> program to handle files with one encoding, but not another. I've >>>>> tried a bunch of values in these variables, including ISO-8859-1, C, >>>>> POSIX, UTF-8, and so on. >>>>> >>>>> Is there such a thing as a filename encoding that will map 8 bit >>>>> filenames to 16 bit characters, but only using the low 8 bits of those >>>>> 16, and then map back to 8 bit filenames only using those low 8 bits >>>>> again? >>>>> >>>>> Is there some other way of making a Java program on Linux able to read >>>>> filenames from stdin and later open those filenames? >>>>> >>>>> Thanks! >>>>> >>> >>> -- >>> Naoto Sato >>> > > > From martinrb at google.com Fri Sep 12 12:49:13 2008 From: martinrb at google.com (Martin Buchholz) Date: Fri, 12 Sep 2008 12:49:13 -0700 Subject: Reading Linux filenames in a way that will map back the same on open? In-Reply-To: <48C86B69.2020501@gmail.com> References: <33c5a6c30809091739j41fad37bybfc92708fa3d0cca@mail.gmail.com> <1ccfd1c10809092058l427e0cd3kb73611878026e6eb@mail.gmail.com> <48C84217.9000709@Sun.COM> <1ccfd1c10809101514g545d2896lb24ed634eca089b5@mail.gmail.com> <48C86B69.2020501@gmail.com> Message-ID: <1ccfd1c10809121249t3e8c4459g7e01ec2c2aa5b2a7@mail.gmail.com> On Wed, Sep 10, 2008 at 17:50, Dan Stromberg wrote: > > > Would you believe that I'm getting file not found errors even with > ISO-8859-1? The software world is full of suprises. Try export LANG=C LC_ALL=C LC_CTYPE=C java ... -Dfile.encoding=ISO-8859-1 ... You could also be explicit about the encoding used when doing any kind of char<->byte conversion, e.g. reading from stdin or writing to stdout. Oh, and this is only traditional Unix systems like Linux and Solaris. Windows and MacOSX (at least should) act very differently in this area. Martin > (Naoto: My program doesn't know what encoding to expect - I'm afraid I > probably have different applications writing filenames in different > encodings on my Ubuntu system. I'd been thinking I wanted to treat > filenames as just a sequence of bytes, and let the terminal emulator > interpret the encoding (hopefully) correctly on output). > > > > This gives two file not found tracebacks: > > export LC_ALL='ISO-8859-1' > export LC_CTYPE="$LC_ALL" > export LANG="$LC_ALL" > > find 'test-files' -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1 > -jar equivs.jar equivs.main > > find ~/Sound/Music -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1 > -jar equivs.jar equivs.main > > > > I'm reading the filenames like (please forgive the weird indentation) : > > try { > > while((line = stdin.readLine()) != null) > { > // System.out.println(line); > // System.out.flush(); > lst.add(new Sortable_file(line)); > } > } > catch(java.io.IOException e) > { > System.err.println("**** exception " + e); > e.printStackTrace(); } > > > > Where Sortable_file's constructor just looks like: > > public Sortable_file(String filename) > { > this.filename = filename; > /* > Java doesn't have a stat function without doing some fancy stuff, so we > skip this > optimization. It really only helps with hard links anyway. > this.device = -1 > this.inode = -1 > */ > File file = new File(this.filename); > this.size = file.length(); > // It bothers a little that we can't close this, but perhaps it's > unnecessary. That'll > // be determined in large tests. > // file.close(); > this.have_prefix = false; > this.have_hash = false; > } > > > > ..and the part that actually blows up looks like: > > private void get_prefix() > { > byte[] buffer = new byte[128]; > try > { > // The next line is the one that gives file not found > FileInputStream file = new FileInputStream(this.filename); > file.read(buffer); > // System.out.println("this.prefix.length " + this.prefix.length); > file.close(); > } > catch (IOException ioe) > { > // System.out.println( "IO error: " + ioe ); > ioe.printStackTrace(); > System.exit(1); > } > this.prefix = new String(buffer); > this.have_prefix = true; > } > > > > Interestingly, it's already tried to get the file's length without an error > when it goes to read data from the file and has trouble. > > I don't -think- I'm doing anything screwy in there - could it be that > ISO-8859-1 isn't giving good round-trip conversions in practice? Would this > be an attribute of the java runtime in question, or could it be a matter of > the locale files on my Ubuntu system being a little off? It would seem the > locale files would be a better explanation (or a bug in my program I'm not > seeing!), since I get the same errors with both OpenJDK and gcj. > > Martin Buchholz wrote: >> >> ISO-8859-1 guarantees round-trip conversion between bytes and chars, >> guarateeing no loss of data, or getting apparently impossible situations >> where the JDK gives you a list of files in a directory, but you get >> File not found when you try to open them. >> >> If you want to show the file names to users, you can always take >> your ISO-8859-1 decoded strings, turn them back into byte[], >> and decode using UTF-8 later, if you so desired. >> (The basic OS interfaces in the JDK are not so flexible. >> They are hard-coded to use the one charset specified by file.encoding) >> >> Martin >> >> On Wed, Sep 10, 2008 at 14:54, Naoto Sato wrote: >>> >>> Why ISO-8859-1? CJK filenames are guaranteed to fail in that case. I'd >>> rather choose UTF-8, as the default encoding on recent Unix/Linux are all >>> UTF-8 so the filenames are likely in UTF-8. >>> >>> Naoto >>> >>> Martin Buchholz wrote: >>>> >>>> Java made the decision to use String as an abstraction >>>> for many OS-specific objects, like filenames (or environment variables). >>>> Most of the time this works fine, but occasionally you can notice >>>> that the underlying OS (in the case of Unix) actually uses >>>> arbitrary byte arrays as filenames. >>>> >>>> It would have been much more confusing to provide an interface >>>> to filenames that is sometimes a sequence of char, sometimes a >>>> sequence of byte. >>>> >>>> So this is unlikely to change. >>>> >>>> But if all you want is reliable reversible conversion, >>>> using java -Dfile.encoding=ISO-8859-1 >>>> should do the trick. >>>> >>>> Martin >>>> >>>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg >>>> wrote: >>>>> >>>>> Sorry if this is the wrong list for this question. I tried asking it >>>>> on comp.lang.java, but didn't get very far there. >>>>> >>>>> I've been wanting to expand my horizons a bit by taking one of my >>>>> programs and rewriting it into a number of other languages. It >>>>> started life in python, and I've recoded it into perl >>>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html). >>>>> Next on my list is java. After that I'll probably do Haskell and >>>>> Eiffel/Sather. >>>>> >>>>> So the python and perl versions were pretty easy, but I'm finding that >>>>> the java version has a somewhat solution-resistant problem with >>>>> non-ASCII filenames. >>>>> >>>>> The program just reads filenames from stdin (usually generated with >>>>> the *ix find command), and then compares those files, dividing them up >>>>> into equal groups. >>>>> >>>>> The problem with the java version, which manifests both with OpenJDK >>>>> and gcj, is that the filenames being read from disk are 8 bit, and the >>>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit, >>>>> but as far as the java language is concerned, those filenames are made >>>>> up of 16 bit characters. That's fine, but going from 8 to 16 bit and >>>>> back to 8 bit seems to be non-information-preserving in this case, >>>>> which isn't so fine - I can clearly see the program, in an strace, >>>>> reading with one sequence of bytes, but then trying to open >>>>> another-though-related sequence of bytes. To be perfectly clear: It's >>>>> getting file not found errors. >>>>> >>>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the >>>>> program to handle files with one encoding, but not another. I've >>>>> tried a bunch of values in these variables, including ISO-8859-1, C, >>>>> POSIX, UTF-8, and so on. >>>>> >>>>> Is there such a thing as a filename encoding that will map 8 bit >>>>> filenames to 16 bit characters, but only using the low 8 bits of those >>>>> 16, and then map back to 8 bit filenames only using those low 8 bits >>>>> again? >>>>> >>>>> Is there some other way of making a Java program on Linux able to read >>>>> filenames from stdin and later open those filenames? >>>>> >>>>> Thanks! >>>>> >>> >>> -- >>> Naoto Sato >>> > > From strombrg at gmail.com Sat Sep 13 09:32:52 2008 From: strombrg at gmail.com (Dan Stromberg) Date: Sat, 13 Sep 2008 09:32:52 -0700 Subject: Reading Linux filenames in a way that will map back the same on open? In-Reply-To: <1ccfd1c10809121249t3e8c4459g7e01ec2c2aa5b2a7@mail.gmail.com> References: <33c5a6c30809091739j41fad37bybfc92708fa3d0cca@mail.gmail.com> <1ccfd1c10809092058l427e0cd3kb73611878026e6eb@mail.gmail.com> <48C84217.9000709@Sun.COM> <1ccfd1c10809101514g545d2896lb24ed634eca089b5@mail.gmail.com> <48C86B69.2020501@gmail.com> <1ccfd1c10809121249t3e8c4459g7e01ec2c2aa5b2a7@mail.gmail.com> Message-ID: <48CBEB34.8030709@gmail.com> Sadly, I'm still getting ghost files with C and ISO-8859-1: ./wrapper + case 3 in + export LC_ALL=C + LC_ALL=C + export LC_CTYPE=C + LC_CTYPE=C + export LANG=C + LANG=C + find /home/dstromberg/Sound/Music -type f -print + java -Xmx512M -Dfile.encoding=ISO-8859-1 -jar equivs.jar equivs.main java.io.FileNotFoundException: /home/dstromberg/Sound/Music/Various Artists/Dreamland/11 - Canci??n Para Dormir a un Ni??o (Argentina).flac (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:106) at java.io.FileInputStream.(FileInputStream.java:66) at Sortable_file.get_prefix(Sortable_file.java:56) at Sortable_file.compareTo(Sortable_file.java:159) at Sortable_file.compareTo(Sortable_file.java:1) at java.util.Arrays.mergeSort(Arrays.java:1167) at java.util.Arrays.mergeSort(Arrays.java:1155) at java.util.Arrays.mergeSort(Arrays.java:1155) at java.util.Arrays.mergeSort(Arrays.java:1156) at java.util.Arrays.mergeSort(Arrays.java:1156) at java.util.Arrays.mergeSort(Arrays.java:1155) at java.util.Arrays.mergeSort(Arrays.java:1155) at java.util.Arrays.mergeSort(Arrays.java:1155) at java.util.Arrays.mergeSort(Arrays.java:1155) at java.util.Arrays.mergeSort(Arrays.java:1155) at java.util.Arrays.sort(Arrays.java:1079) at equivs.main(equivs.java:40) make: *** [wrapped] Error 1 Martin Buchholz wrote: > On Wed, Sep 10, 2008 at 17:50, Dan Stromberg wrote: >> >> Would you believe that I'm getting file not found errors even with >> ISO-8859-1? > > The software world is full of suprises. > > Try > export LANG=C LC_ALL=C LC_CTYPE=C > java ... -Dfile.encoding=ISO-8859-1 ... > > You could also be explicit about the > encoding used when doing any kind of char<->byte > conversion, e.g. reading from stdin or writing to stdout. > > Oh, and this is only traditional Unix systems like > Linux and Solaris. Windows and MacOSX > (at least should) act very differently in this area. > > Martin > >> (Naoto: My program doesn't know what encoding to expect - I'm afraid I >> probably have different applications writing filenames in different >> encodings on my Ubuntu system. I'd been thinking I wanted to treat >> filenames as just a sequence of bytes, and let the terminal emulator >> interpret the encoding (hopefully) correctly on output). >> >> >> >> This gives two file not found tracebacks: >> >> export LC_ALL='ISO-8859-1' >> export LC_CTYPE="$LC_ALL" >> export LANG="$LC_ALL" >> >> find 'test-files' -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1 >> -jar equivs.jar equivs.main >> >> find ~/Sound/Music -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1 >> -jar equivs.jar equivs.main >> >> >> >> I'm reading the filenames like (please forgive the weird indentation) : >> >> try { >> >> while((line = stdin.readLine()) != null) >> { >> // System.out.println(line); >> // System.out.flush(); >> lst.add(new Sortable_file(line)); >> } >> } >> catch(java.io.IOException e) >> { >> System.err.println("**** exception " + e); >> e.printStackTrace(); } >> >> >> >> Where Sortable_file's constructor just looks like: >> >> public Sortable_file(String filename) >> { >> this.filename = filename; >> /* >> Java doesn't have a stat function without doing some fancy stuff, so we >> skip this >> optimization. It really only helps with hard links anyway. >> this.device = -1 >> this.inode = -1 >> */ >> File file = new File(this.filename); >> this.size = file.length(); >> // It bothers a little that we can't close this, but perhaps it's >> unnecessary. That'll >> // be determined in large tests. >> // file.close(); >> this.have_prefix = false; >> this.have_hash = false; >> } >> >> >> >> ..and the part that actually blows up looks like: >> >> private void get_prefix() >> { >> byte[] buffer = new byte[128]; >> try >> { >> // The next line is the one that gives file not found >> FileInputStream file = new FileInputStream(this.filename); >> file.read(buffer); >> // System.out.println("this.prefix.length " + this.prefix.length); >> file.close(); >> } >> catch (IOException ioe) >> { >> // System.out.println( "IO error: " + ioe ); >> ioe.printStackTrace(); >> System.exit(1); >> } >> this.prefix = new String(buffer); >> this.have_prefix = true; >> } >> >> >> >> Interestingly, it's already tried to get the file's length without an error >> when it goes to read data from the file and has trouble. >> >> I don't -think- I'm doing anything screwy in there - could it be that >> ISO-8859-1 isn't giving good round-trip conversions in practice? Would this >> be an attribute of the java runtime in question, or could it be a matter of >> the locale files on my Ubuntu system being a little off? It would seem the >> locale files would be a better explanation (or a bug in my program I'm not >> seeing!), since I get the same errors with both OpenJDK and gcj. >> >> Martin Buchholz wrote: >>> ISO-8859-1 guarantees round-trip conversion between bytes and chars, >>> guarateeing no loss of data, or getting apparently impossible situations >>> where the JDK gives you a list of files in a directory, but you get >>> File not found when you try to open them. >>> >>> If you want to show the file names to users, you can always take >>> your ISO-8859-1 decoded strings, turn them back into byte[], >>> and decode using UTF-8 later, if you so desired. >>> (The basic OS interfaces in the JDK are not so flexible. >>> They are hard-coded to use the one charset specified by file.encoding) >>> >>> Martin >>> >>> On Wed, Sep 10, 2008 at 14:54, Naoto Sato wrote: >>>> Why ISO-8859-1? CJK filenames are guaranteed to fail in that case. I'd >>>> rather choose UTF-8, as the default encoding on recent Unix/Linux are all >>>> UTF-8 so the filenames are likely in UTF-8. >>>> >>>> Naoto >>>> >>>> Martin Buchholz wrote: >>>>> Java made the decision to use String as an abstraction >>>>> for many OS-specific objects, like filenames (or environment variables). >>>>> Most of the time this works fine, but occasionally you can notice >>>>> that the underlying OS (in the case of Unix) actually uses >>>>> arbitrary byte arrays as filenames. >>>>> >>>>> It would have been much more confusing to provide an interface >>>>> to filenames that is sometimes a sequence of char, sometimes a >>>>> sequence of byte. >>>>> >>>>> So this is unlikely to change. >>>>> >>>>> But if all you want is reliable reversible conversion, >>>>> using java -Dfile.encoding=ISO-8859-1 >>>>> should do the trick. >>>>> >>>>> Martin >>>>> >>>>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg >>>>> wrote: >>>>>> Sorry if this is the wrong list for this question. I tried asking it >>>>>> on comp.lang.java, but didn't get very far there. >>>>>> >>>>>> I've been wanting to expand my horizons a bit by taking one of my >>>>>> programs and rewriting it into a number of other languages. It >>>>>> started life in python, and I've recoded it into perl >>>>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html). >>>>>> Next on my list is java. After that I'll probably do Haskell and >>>>>> Eiffel/Sather. >>>>>> >>>>>> So the python and perl versions were pretty easy, but I'm finding that >>>>>> the java version has a somewhat solution-resistant problem with >>>>>> non-ASCII filenames. >>>>>> >>>>>> The program just reads filenames from stdin (usually generated with >>>>>> the *ix find command), and then compares those files, dividing them up >>>>>> into equal groups. >>>>>> >>>>>> The problem with the java version, which manifests both with OpenJDK >>>>>> and gcj, is that the filenames being read from disk are 8 bit, and the >>>>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit, >>>>>> but as far as the java language is concerned, those filenames are made >>>>>> up of 16 bit characters. That's fine, but going from 8 to 16 bit and >>>>>> back to 8 bit seems to be non-information-preserving in this case, >>>>>> which isn't so fine - I can clearly see the program, in an strace, >>>>>> reading with one sequence of bytes, but then trying to open >>>>>> another-though-related sequence of bytes. To be perfectly clear: It's >>>>>> getting file not found errors. >>>>>> >>>>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the >>>>>> program to handle files with one encoding, but not another. I've >>>>>> tried a bunch of values in these variables, including ISO-8859-1, C, >>>>>> POSIX, UTF-8, and so on. >>>>>> >>>>>> Is there such a thing as a filename encoding that will map 8 bit >>>>>> filenames to 16 bit characters, but only using the low 8 bits of those >>>>>> 16, and then map back to 8 bit filenames only using those low 8 bits >>>>>> again? >>>>>> >>>>>> Is there some other way of making a Java program on Linux able to read >>>>>> filenames from stdin and later open those filenames? >>>>>> >>>>>> Thanks! >>>>>> >>>> -- >>>> Naoto Sato >>>> >> From Xueming.Shen at Sun.COM Sat Sep 13 14:57:08 2008 From: Xueming.Shen at Sun.COM (Xueming Shen) Date: Sat, 13 Sep 2008 14:57:08 -0700 Subject: Reading Linux filenames in a way that will map back the same on open? In-Reply-To: <1ccfd1c10809121249t3e8c4459g7e01ec2c2aa5b2a7@mail.gmail.com> References: <33c5a6c30809091739j41fad37bybfc92708fa3d0cca@mail.gmail.com> <1ccfd1c10809092058l427e0cd3kb73611878026e6eb@mail.gmail.com> <48C84217.9000709@Sun.COM> <1ccfd1c10809101514g545d2896lb24ed634eca089b5@mail.gmail.com> <48C86B69.2020501@gmail.com> <1ccfd1c10809121249t3e8c4459g7e01ec2c2aa5b2a7@mail.gmail.com> Message-ID: <48CC3734.9040400@sun.com> Martin, don't trap people into using -Dfile.encoding, always treat it as a read only property:-) I believe initializeEncoding(env) gets invoked before -Dxyz=abc overwrites the default one, beside the "jnu encoding" is introduced in 6.0, so we no longer look file.encoding since, I believe you "ARE" the reviewer:-) Dan, I kind of feel, switch the locale to a sio8859-1 locale in your wrapper, for example LC_ALL=en_US.ISO8859-1; export LC_ALL; java -Xmx512M -jar equivs.jar equivs.main should work, if it does not, can you try to run LC_ALL=en_US.ISO8859-1; export LC_ALL; java Foo with Foo.java System.out.println("sun.jnu.encoding=" + System.getProperty("sun.jnu.encoding")); System.out.println("file.encoding=" + System.getProperty("file.encoding")); System.out.println("default locale=" + java.util.Locale.getDefault()); Let us know the result? sherman Martin Buchholz wrote: > On Wed, Sep 10, 2008 at 17:50, Dan Stromberg wrote: > >> Would you believe that I'm getting file not found errors even with >> ISO-8859-1? >> > > The software world is full of suprises. > > Try > export LANG=C LC_ALL=C LC_CTYPE=C > java ... -Dfile.encoding=ISO-8859-1 ... > > You could also be explicit about the > encoding used when doing any kind of char<->byte > conversion, e.g. reading from stdin or writing to stdout. > > Oh, and this is only traditional Unix systems like > Linux and Solaris. Windows and MacOSX > (at least should) act very differently in this area. > > Martin > > >> (Naoto: My program doesn't know what encoding to expect - I'm afraid I >> probably have different applications writing filenames in different >> encodings on my Ubuntu system. I'd been thinking I wanted to treat >> filenames as just a sequence of bytes, and let the terminal emulator >> interpret the encoding (hopefully) correctly on output). >> >> >> >> This gives two file not found tracebacks: >> >> export LC_ALL='ISO-8859-1' >> export LC_CTYPE="$LC_ALL" >> export LANG="$LC_ALL" >> >> find 'test-files' -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1 >> -jar equivs.jar equivs.main >> >> find ~/Sound/Music -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1 >> -jar equivs.jar equivs.main >> >> >> >> I'm reading the filenames like (please forgive the weird indentation) : >> >> try { >> >> while((line = stdin.readLine()) != null) >> { >> // System.out.println(line); >> // System.out.flush(); >> lst.add(new Sortable_file(line)); >> } >> } >> catch(java.io.IOException e) >> { >> System.err.println("**** exception " + e); >> e.printStackTrace(); } >> >> >> >> Where Sortable_file's constructor just looks like: >> >> public Sortable_file(String filename) >> { >> this.filename = filename; >> /* >> Java doesn't have a stat function without doing some fancy stuff, so we >> skip this >> optimization. It really only helps with hard links anyway. >> this.device = -1 >> this.inode = -1 >> */ >> File file = new File(this.filename); >> this.size = file.length(); >> // It bothers a little that we can't close this, but perhaps it's >> unnecessary. That'll >> // be determined in large tests. >> // file.close(); >> this.have_prefix = false; >> this.have_hash = false; >> } >> >> >> >> ..and the part that actually blows up looks like: >> >> private void get_prefix() >> { >> byte[] buffer = new byte[128]; >> try >> { >> // The next line is the one that gives file not found >> FileInputStream file = new FileInputStream(this.filename); >> file.read(buffer); >> // System.out.println("this.prefix.length " + this.prefix.length); >> file.close(); >> } >> catch (IOException ioe) >> { >> // System.out.println( "IO error: " + ioe ); >> ioe.printStackTrace(); >> System.exit(1); >> } >> this.prefix = new String(buffer); >> this.have_prefix = true; >> } >> >> >> >> Interestingly, it's already tried to get the file's length without an error >> when it goes to read data from the file and has trouble. >> >> I don't -think- I'm doing anything screwy in there - could it be that >> ISO-8859-1 isn't giving good round-trip conversions in practice? Would this >> be an attribute of the java runtime in question, or could it be a matter of >> the locale files on my Ubuntu system being a little off? It would seem the >> locale files would be a better explanation (or a bug in my program I'm not >> seeing!), since I get the same errors with both OpenJDK and gcj. >> >> Martin Buchholz wrote: >> >>> ISO-8859-1 guarantees round-trip conversion between bytes and chars, >>> guarateeing no loss of data, or getting apparently impossible situations >>> where the JDK gives you a list of files in a directory, but you get >>> File not found when you try to open them. >>> >>> If you want to show the file names to users, you can always take >>> your ISO-8859-1 decoded strings, turn them back into byte[], >>> and decode using UTF-8 later, if you so desired. >>> (The basic OS interfaces in the JDK are not so flexible. >>> They are hard-coded to use the one charset specified by file.encoding) >>> >>> Martin >>> >>> On Wed, Sep 10, 2008 at 14:54, Naoto Sato wrote: >>> >>>> Why ISO-8859-1? CJK filenames are guaranteed to fail in that case. I'd >>>> rather choose UTF-8, as the default encoding on recent Unix/Linux are all >>>> UTF-8 so the filenames are likely in UTF-8. >>>> >>>> Naoto >>>> >>>> Martin Buchholz wrote: >>>> >>>>> Java made the decision to use String as an abstraction >>>>> for many OS-specific objects, like filenames (or environment variables). >>>>> Most of the time this works fine, but occasionally you can notice >>>>> that the underlying OS (in the case of Unix) actually uses >>>>> arbitrary byte arrays as filenames. >>>>> >>>>> It would have been much more confusing to provide an interface >>>>> to filenames that is sometimes a sequence of char, sometimes a >>>>> sequence of byte. >>>>> >>>>> So this is unlikely to change. >>>>> >>>>> But if all you want is reliable reversible conversion, >>>>> using java -Dfile.encoding=ISO-8859-1 >>>>> should do the trick. >>>>> >>>>> Martin >>>>> >>>>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg >>>>> wrote: >>>>> >>>>>> Sorry if this is the wrong list for this question. I tried asking it >>>>>> on comp.lang.java, but didn't get very far there. >>>>>> >>>>>> I've been wanting to expand my horizons a bit by taking one of my >>>>>> programs and rewriting it into a number of other languages. It >>>>>> started life in python, and I've recoded it into perl >>>>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html). >>>>>> Next on my list is java. After that I'll probably do Haskell and >>>>>> Eiffel/Sather. >>>>>> >>>>>> So the python and perl versions were pretty easy, but I'm finding that >>>>>> the java version has a somewhat solution-resistant problem with >>>>>> non-ASCII filenames. >>>>>> >>>>>> The program just reads filenames from stdin (usually generated with >>>>>> the *ix find command), and then compares those files, dividing them up >>>>>> into equal groups. >>>>>> >>>>>> The problem with the java version, which manifests both with OpenJDK >>>>>> and gcj, is that the filenames being read from disk are 8 bit, and the >>>>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit, >>>>>> but as far as the java language is concerned, those filenames are made >>>>>> up of 16 bit characters. That's fine, but going from 8 to 16 bit and >>>>>> back to 8 bit seems to be non-information-preserving in this case, >>>>>> which isn't so fine - I can clearly see the program, in an strace, >>>>>> reading with one sequence of bytes, but then trying to open >>>>>> another-though-related sequence of bytes. To be perfectly clear: It's >>>>>> getting file not found errors. >>>>>> >>>>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the >>>>>> program to handle files with one encoding, but not another. I've >>>>>> tried a bunch of values in these variables, including ISO-8859-1, C, >>>>>> POSIX, UTF-8, and so on. >>>>>> >>>>>> Is there such a thing as a filename encoding that will map 8 bit >>>>>> filenames to 16 bit characters, but only using the low 8 bits of those >>>>>> 16, and then map back to 8 bit filenames only using those low 8 bits >>>>>> again? >>>>>> >>>>>> Is there some other way of making a Java program on Linux able to read >>>>>> filenames from stdin and later open those filenames? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>> -- >>>> Naoto Sato >>>> >>>> >> From strombrg at gmail.com Sat Sep 13 18:49:33 2008 From: strombrg at gmail.com (Dan Stromberg) Date: Sat, 13 Sep 2008 18:49:33 -0700 Subject: Reading Linux filenames in a way that will map back the same on open? In-Reply-To: <48CC3734.9040400@sun.com> References: <33c5a6c30809091739j41fad37bybfc92708fa3d0cca@mail.gmail.com> <1ccfd1c10809092058l427e0cd3kb73611878026e6eb@mail.gmail.com> <48C84217.9000709@Sun.COM> <1ccfd1c10809101514g545d2896lb24ed634eca089b5@mail.gmail.com> <48C86B69.2020501@gmail.com> <1ccfd1c10809121249t3e8c4459g7e01ec2c2aa5b2a7@mail.gmail.com> <48CC3734.9040400@sun.com> Message-ID: <48CC6DAD.3030403@gmail.com> It still errors with a file not found: + LC_ALL=en_US.ISO8859-1 + export LC_ALL + find /home/dstromberg/Sound/Music -type f -print + java -Xmx512M -jar equivs.jar equivs.main java.io.FileNotFoundException: /home/dstromberg/Sound/Music/Various Artists/Dreamland/11 - Canci??n Para Dormir a un Ni??o (Argentina).flac (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:106) at java.io.FileInputStream.(FileInputStream.java:66) at Sortable_file.get_prefix(Sortable_file.java:56) at Sortable_file.compareTo(Sortable_file.java:159) at Sortable_file.compareTo(Sortable_file.java:1) at java.util.Arrays.mergeSort(Arrays.java:1167) at java.util.Arrays.mergeSort(Arrays.java:1155) at java.util.Arrays.mergeSort(Arrays.java:1155) at java.util.Arrays.mergeSort(Arrays.java:1156) at java.util.Arrays.mergeSort(Arrays.java:1156) at java.util.Arrays.mergeSort(Arrays.java:1155) at java.util.Arrays.mergeSort(Arrays.java:1155) at java.util.Arrays.mergeSort(Arrays.java:1155) at java.util.Arrays.mergeSort(Arrays.java:1155) at java.util.Arrays.mergeSort(Arrays.java:1155) at java.util.Arrays.sort(Arrays.java:1079) at equivs.main(equivs.java:40) make: *** [wrapped] Error 1 ...and the foo.java program gives: $ LC_ALL=en_US.ISO8859-1; export LC_ALL; java foo sun.jnu.encoding=ANSI_X3.4-1968 file.encoding=ANSI_X3.4-1968 default locale=en_US Thanks folks. Xueming Shen wrote: > > Martin, don't trap people into using -Dfile.encoding, always treat it > as a read only property:-) > > I believe initializeEncoding(env) gets invoked before -Dxyz=abc > overwrites the default one, > beside the "jnu encoding" is introduced in 6.0, so we no longer look > file.encoding since, I believe > you "ARE" the reviewer:-) > > Dan, I kind of feel, switch the locale to a sio8859-1 locale in your > wrapper, for example > > LC_ALL=en_US.ISO8859-1; export LC_ALL; java -Xmx512M -jar equivs.jar > equivs.main > > should work, if it does not, can you try to run > > LC_ALL=en_US.ISO8859-1; export LC_ALL; java Foo > > with Foo.java > > System.out.println("sun.jnu.encoding=" + > System.getProperty("sun.jnu.encoding")); > System.out.println("file.encoding=" + > System.getProperty("file.encoding")); > System.out.println("default locale=" + java.util.Locale.getDefault()); > > Let us know the result? > > sherman > > > Martin Buchholz wrote: >> On Wed, Sep 10, 2008 at 17:50, Dan Stromberg wrote: >> >>> Would you believe that I'm getting file not found errors even with >>> ISO-8859-1? >>> >> >> The software world is full of suprises. >> >> Try >> export LANG=C LC_ALL=C LC_CTYPE=C >> java ... -Dfile.encoding=ISO-8859-1 ... >> >> You could also be explicit about the >> encoding used when doing any kind of char<->byte >> conversion, e.g. reading from stdin or writing to stdout. >> >> Oh, and this is only traditional Unix systems like >> Linux and Solaris. Windows and MacOSX >> (at least should) act very differently in this area. >> >> Martin >> >> >>> (Naoto: My program doesn't know what encoding to expect - I'm afraid I >>> probably have different applications writing filenames in different >>> encodings on my Ubuntu system. I'd been thinking I wanted to treat >>> filenames as just a sequence of bytes, and let the terminal emulator >>> interpret the encoding (hopefully) correctly on output). >>> >>> >>> >>> This gives two file not found tracebacks: >>> >>> export LC_ALL='ISO-8859-1' >>> export LC_CTYPE="$LC_ALL" >>> export LANG="$LC_ALL" >>> >>> find 'test-files' -type f -print | java -Xmx512M >>> -Dfile.encoding=ISO-8859-1 >>> -jar equivs.jar equivs.main >>> >>> find ~/Sound/Music -type f -print | java -Xmx512M >>> -Dfile.encoding=ISO-8859-1 >>> -jar equivs.jar equivs.main >>> >>> >>> >>> I'm reading the filenames like (please forgive the weird indentation) : >>> >>> try { >>> >>> while((line = stdin.readLine()) != null) >>> { >>> // System.out.println(line); >>> // System.out.flush(); >>> lst.add(new Sortable_file(line)); >>> } >>> } >>> catch(java.io.IOException e) >>> { >>> System.err.println("**** exception " + e); >>> e.printStackTrace(); >>> } >>> >>> >>> >>> Where Sortable_file's constructor just looks like: >>> >>> public Sortable_file(String filename) >>> { >>> this.filename = filename; >>> /* >>> Java doesn't have a stat function without doing some fancy >>> stuff, so we >>> skip this >>> optimization. It really only helps with hard links anyway. >>> this.device = -1 >>> this.inode = -1 >>> */ >>> File file = new File(this.filename); >>> this.size = file.length(); >>> // It bothers a little that we can't close this, but perhaps it's >>> unnecessary. That'll >>> // be determined in large tests. >>> // file.close(); >>> this.have_prefix = false; >>> this.have_hash = false; >>> } >>> >>> >>> >>> ..and the part that actually blows up looks like: >>> >>> private void get_prefix() >>> { >>> byte[] buffer = new byte[128]; >>> try >>> { >>> // The next line is the one that gives file not found >>> FileInputStream file = new FileInputStream(this.filename); >>> file.read(buffer); >>> // System.out.println("this.prefix.length " + >>> this.prefix.length); >>> file.close(); >>> } >>> catch (IOException ioe) >>> { >>> // System.out.println( "IO error: " + ioe ); >>> ioe.printStackTrace(); >>> System.exit(1); >>> } >>> this.prefix = new String(buffer); >>> this.have_prefix = true; >>> } >>> >>> >>> >>> Interestingly, it's already tried to get the file's length without >>> an error >>> when it goes to read data from the file and has trouble. >>> >>> I don't -think- I'm doing anything screwy in there - could it be that >>> ISO-8859-1 isn't giving good round-trip conversions in practice? >>> Would this >>> be an attribute of the java runtime in question, or could it be a >>> matter of >>> the locale files on my Ubuntu system being a little off? It would >>> seem the >>> locale files would be a better explanation (or a bug in my program >>> I'm not >>> seeing!), since I get the same errors with both OpenJDK and gcj. >>> >>> Martin Buchholz wrote: >>> >>>> ISO-8859-1 guarantees round-trip conversion between bytes and chars, >>>> guarateeing no loss of data, or getting apparently impossible >>>> situations >>>> where the JDK gives you a list of files in a directory, but you get >>>> File not found when you try to open them. >>>> >>>> If you want to show the file names to users, you can always take >>>> your ISO-8859-1 decoded strings, turn them back into byte[], >>>> and decode using UTF-8 later, if you so desired. >>>> (The basic OS interfaces in the JDK are not so flexible. >>>> They are hard-coded to use the one charset specified by file.encoding) >>>> >>>> Martin >>>> >>>> On Wed, Sep 10, 2008 at 14:54, Naoto Sato wrote: >>>> >>>>> Why ISO-8859-1? CJK filenames are guaranteed to fail in that >>>>> case. I'd >>>>> rather choose UTF-8, as the default encoding on recent Unix/Linux >>>>> are all >>>>> UTF-8 so the filenames are likely in UTF-8. >>>>> >>>>> Naoto >>>>> >>>>> Martin Buchholz wrote: >>>>> >>>>>> Java made the decision to use String as an abstraction >>>>>> for many OS-specific objects, like filenames (or environment >>>>>> variables). >>>>>> Most of the time this works fine, but occasionally you can notice >>>>>> that the underlying OS (in the case of Unix) actually uses >>>>>> arbitrary byte arrays as filenames. >>>>>> >>>>>> It would have been much more confusing to provide an interface >>>>>> to filenames that is sometimes a sequence of char, sometimes a >>>>>> sequence of byte. >>>>>> >>>>>> So this is unlikely to change. >>>>>> >>>>>> But if all you want is reliable reversible conversion, >>>>>> using java -Dfile.encoding=ISO-8859-1 >>>>>> should do the trick. >>>>>> >>>>>> Martin >>>>>> >>>>>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg >>>>>> >>>>>> wrote: >>>>>> >>>>>>> Sorry if this is the wrong list for this question. I tried >>>>>>> asking it >>>>>>> on comp.lang.java, but didn't get very far there. >>>>>>> >>>>>>> I've been wanting to expand my horizons a bit by taking one of my >>>>>>> programs and rewriting it into a number of other languages. It >>>>>>> started life in python, and I've recoded it into perl >>>>>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html). >>>>>>> Next on my list is java. After that I'll probably do Haskell and >>>>>>> Eiffel/Sather. >>>>>>> >>>>>>> So the python and perl versions were pretty easy, but I'm >>>>>>> finding that >>>>>>> the java version has a somewhat solution-resistant problem with >>>>>>> non-ASCII filenames. >>>>>>> >>>>>>> The program just reads filenames from stdin (usually generated with >>>>>>> the *ix find command), and then compares those files, dividing >>>>>>> them up >>>>>>> into equal groups. >>>>>>> >>>>>>> The problem with the java version, which manifests both with >>>>>>> OpenJDK >>>>>>> and gcj, is that the filenames being read from disk are 8 bit, >>>>>>> and the >>>>>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 >>>>>>> bit, >>>>>>> but as far as the java language is concerned, those filenames >>>>>>> are made >>>>>>> up of 16 bit characters. That's fine, but going from 8 to 16 >>>>>>> bit and >>>>>>> back to 8 bit seems to be non-information-preserving in this case, >>>>>>> which isn't so fine - I can clearly see the program, in an strace, >>>>>>> reading with one sequence of bytes, but then trying to open >>>>>>> another-though-related sequence of bytes. To be perfectly >>>>>>> clear: It's >>>>>>> getting file not found errors. >>>>>>> >>>>>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can >>>>>>> get the >>>>>>> program to handle files with one encoding, but not another. I've >>>>>>> tried a bunch of values in these variables, including >>>>>>> ISO-8859-1, C, >>>>>>> POSIX, UTF-8, and so on. >>>>>>> >>>>>>> Is there such a thing as a filename encoding that will map 8 bit >>>>>>> filenames to 16 bit characters, but only using the low 8 bits of >>>>>>> those >>>>>>> 16, and then map back to 8 bit filenames only using those low 8 >>>>>>> bits >>>>>>> again? >>>>>>> >>>>>>> Is there some other way of making a Java program on Linux able >>>>>>> to read >>>>>>> filenames from stdin and later open those filenames? >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> >>>>> -- >>>>> Naoto Sato >>>>> >>>>> >>> > From Xueming.Shen at Sun.COM Sat Sep 13 20:39:31 2008 From: Xueming.Shen at Sun.COM (Xueming Shen) Date: Sat, 13 Sep 2008 20:39:31 -0700 Subject: Reading Linux filenames in a way that will map back the same on open? In-Reply-To: <48CC6DAD.3030403@gmail.com> References: <33c5a6c30809091739j41fad37bybfc92708fa3d0cca@mail.gmail.com> <1ccfd1c10809092058l427e0cd3kb73611878026e6eb@mail.gmail.com> <48C84217.9000709@Sun.COM> <1ccfd1c10809101514g545d2896lb24ed634eca089b5@mail.gmail.com> <48C86B69.2020501@gmail.com> <1ccfd1c10809121249t3e8c4459g7e01ec2c2aa5b2a7@mail.gmail.com> <48CC3734.9040400@sun.com> <48CC6DAD.3030403@gmail.com> Message-ID: <48CC8773.2050704@sun.com> Obviously your locale setting is not being "exported"...what "shell" are you using? You can try to set your locale to en_US.ISO8859-1 explicitly at command line first, type in "locale" to confirm that your locale is being set correctly to en_US.ISO8859-1, then run the "find + java" to see if that FNF error disappears. If not, run the java Foo again and tell us the result:-) One possibility is that you don't have a ISO8859-1 locale installed at all? Sherman Dan Stromberg wrote: > > It still errors with a file not found: > > + LC_ALL=en_US.ISO8859-1 > + export LC_ALL > + find /home/dstromberg/Sound/Music -type f -print > + java -Xmx512M -jar equivs.jar equivs.main > java.io.FileNotFoundException: /home/dstromberg/Sound/Music/Various > Artists/Dreamland/11 - Canci??n Para Dormir a un Ni??o > (Argentina).flac (No such file or directory) > at java.io.FileInputStream.open(Native Method) > at java.io.FileInputStream.(FileInputStream.java:106) > at java.io.FileInputStream.(FileInputStream.java:66) > at Sortable_file.get_prefix(Sortable_file.java:56) > at Sortable_file.compareTo(Sortable_file.java:159) > at Sortable_file.compareTo(Sortable_file.java:1) > at java.util.Arrays.mergeSort(Arrays.java:1167) > at java.util.Arrays.mergeSort(Arrays.java:1155) > at java.util.Arrays.mergeSort(Arrays.java:1155) > at java.util.Arrays.mergeSort(Arrays.java:1156) > at java.util.Arrays.mergeSort(Arrays.java:1156) > at java.util.Arrays.mergeSort(Arrays.java:1155) > at java.util.Arrays.mergeSort(Arrays.java:1155) > at java.util.Arrays.mergeSort(Arrays.java:1155) > at java.util.Arrays.mergeSort(Arrays.java:1155) > at java.util.Arrays.mergeSort(Arrays.java:1155) > at java.util.Arrays.sort(Arrays.java:1079) > at equivs.main(equivs.java:40) > make: *** [wrapped] Error 1 > > ...and the foo.java program gives: > > $ LC_ALL=en_US.ISO8859-1; export LC_ALL; java foo > sun.jnu.encoding=ANSI_X3.4-1968 > file.encoding=ANSI_X3.4-1968 > default locale=en_US > > Thanks folks. > > Xueming Shen wrote: >> >> Martin, don't trap people into using -Dfile.encoding, always treat it >> as a read only property:-) >> >> I believe initializeEncoding(env) gets invoked before -Dxyz=abc >> overwrites the default one, >> beside the "jnu encoding" is introduced in 6.0, so we no longer look >> file.encoding since, I believe >> you "ARE" the reviewer:-) >> >> Dan, I kind of feel, switch the locale to a sio8859-1 locale in your >> wrapper, for example >> >> LC_ALL=en_US.ISO8859-1; export LC_ALL; java -Xmx512M -jar equivs.jar >> equivs.main >> >> should work, if it does not, can you try to run >> >> LC_ALL=en_US.ISO8859-1; export LC_ALL; java Foo >> >> with Foo.java >> >> System.out.println("sun.jnu.encoding=" + >> System.getProperty("sun.jnu.encoding")); >> System.out.println("file.encoding=" + >> System.getProperty("file.encoding")); >> System.out.println("default locale=" + java.util.Locale.getDefault()); >> >> Let us know the result? >> >> sherman >> >> >> Martin Buchholz wrote: >>> On Wed, Sep 10, 2008 at 17:50, Dan Stromberg >>> wrote: >>> >>>> Would you believe that I'm getting file not found errors even with >>>> ISO-8859-1? >>>> >>> >>> The software world is full of suprises. >>> >>> Try >>> export LANG=C LC_ALL=C LC_CTYPE=C >>> java ... -Dfile.encoding=ISO-8859-1 ... >>> >>> You could also be explicit about the >>> encoding used when doing any kind of char<->byte >>> conversion, e.g. reading from stdin or writing to stdout. >>> >>> Oh, and this is only traditional Unix systems like >>> Linux and Solaris. Windows and MacOSX >>> (at least should) act very differently in this area. >>> >>> Martin >>> >>> >>>> (Naoto: My program doesn't know what encoding to expect - I'm afraid I >>>> probably have different applications writing filenames in different >>>> encodings on my Ubuntu system. I'd been thinking I wanted to treat >>>> filenames as just a sequence of bytes, and let the terminal emulator >>>> interpret the encoding (hopefully) correctly on output). >>>> >>>> >>>> >>>> This gives two file not found tracebacks: >>>> >>>> export LC_ALL='ISO-8859-1' >>>> export LC_CTYPE="$LC_ALL" >>>> export LANG="$LC_ALL" >>>> >>>> find 'test-files' -type f -print | java -Xmx512M >>>> -Dfile.encoding=ISO-8859-1 >>>> -jar equivs.jar equivs.main >>>> >>>> find ~/Sound/Music -type f -print | java -Xmx512M >>>> -Dfile.encoding=ISO-8859-1 >>>> -jar equivs.jar equivs.main >>>> >>>> >>>> >>>> I'm reading the filenames like (please forgive the weird >>>> indentation) : >>>> >>>> try { >>>> >>>> while((line = stdin.readLine()) != null) >>>> { >>>> // System.out.println(line); >>>> // System.out.flush(); >>>> lst.add(new Sortable_file(line)); >>>> } >>>> } >>>> catch(java.io.IOException e) >>>> { >>>> System.err.println("**** exception " + e); >>>> e.printStackTrace(); >>>> } >>>> >>>> >>>> >>>> Where Sortable_file's constructor just looks like: >>>> >>>> public Sortable_file(String filename) >>>> { >>>> this.filename = filename; >>>> /* >>>> Java doesn't have a stat function without doing some fancy >>>> stuff, so we >>>> skip this >>>> optimization. It really only helps with hard links anyway. >>>> this.device = -1 >>>> this.inode = -1 >>>> */ >>>> File file = new File(this.filename); >>>> this.size = file.length(); >>>> // It bothers a little that we can't close this, but perhaps it's >>>> unnecessary. That'll >>>> // be determined in large tests. >>>> // file.close(); >>>> this.have_prefix = false; >>>> this.have_hash = false; >>>> } >>>> >>>> >>>> >>>> ..and the part that actually blows up looks like: >>>> >>>> private void get_prefix() >>>> { >>>> byte[] buffer = new byte[128]; >>>> try >>>> { >>>> // The next line is the one that gives file not found >>>> FileInputStream file = new FileInputStream(this.filename); >>>> file.read(buffer); >>>> // System.out.println("this.prefix.length " + >>>> this.prefix.length); >>>> file.close(); >>>> } >>>> catch (IOException ioe) >>>> { >>>> // System.out.println( "IO error: " + ioe ); >>>> ioe.printStackTrace(); >>>> System.exit(1); >>>> } >>>> this.prefix = new String(buffer); >>>> this.have_prefix = true; >>>> } >>>> >>>> >>>> >>>> Interestingly, it's already tried to get the file's length without >>>> an error >>>> when it goes to read data from the file and has trouble. >>>> >>>> I don't -think- I'm doing anything screwy in there - could it be that >>>> ISO-8859-1 isn't giving good round-trip conversions in practice? >>>> Would this >>>> be an attribute of the java runtime in question, or could it be a >>>> matter of >>>> the locale files on my Ubuntu system being a little off? It would >>>> seem the >>>> locale files would be a better explanation (or a bug in my program >>>> I'm not >>>> seeing!), since I get the same errors with both OpenJDK and gcj. >>>> >>>> Martin Buchholz wrote: >>>> >>>>> ISO-8859-1 guarantees round-trip conversion between bytes and chars, >>>>> guarateeing no loss of data, or getting apparently impossible >>>>> situations >>>>> where the JDK gives you a list of files in a directory, but you get >>>>> File not found when you try to open them. >>>>> >>>>> If you want to show the file names to users, you can always take >>>>> your ISO-8859-1 decoded strings, turn them back into byte[], >>>>> and decode using UTF-8 later, if you so desired. >>>>> (The basic OS interfaces in the JDK are not so flexible. >>>>> They are hard-coded to use the one charset specified by >>>>> file.encoding) >>>>> >>>>> Martin >>>>> >>>>> On Wed, Sep 10, 2008 at 14:54, Naoto Sato wrote: >>>>> >>>>>> Why ISO-8859-1? CJK filenames are guaranteed to fail in that >>>>>> case. I'd >>>>>> rather choose UTF-8, as the default encoding on recent Unix/Linux >>>>>> are all >>>>>> UTF-8 so the filenames are likely in UTF-8. >>>>>> >>>>>> Naoto >>>>>> >>>>>> Martin Buchholz wrote: >>>>>> >>>>>>> Java made the decision to use String as an abstraction >>>>>>> for many OS-specific objects, like filenames (or environment >>>>>>> variables). >>>>>>> Most of the time this works fine, but occasionally you can notice >>>>>>> that the underlying OS (in the case of Unix) actually uses >>>>>>> arbitrary byte arrays as filenames. >>>>>>> >>>>>>> It would have been much more confusing to provide an interface >>>>>>> to filenames that is sometimes a sequence of char, sometimes a >>>>>>> sequence of byte. >>>>>>> >>>>>>> So this is unlikely to change. >>>>>>> >>>>>>> But if all you want is reliable reversible conversion, >>>>>>> using java -Dfile.encoding=ISO-8859-1 >>>>>>> should do the trick. >>>>>>> >>>>>>> Martin >>>>>>> >>>>>>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg >>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> Sorry if this is the wrong list for this question. I tried >>>>>>>> asking it >>>>>>>> on comp.lang.java, but didn't get very far there. >>>>>>>> >>>>>>>> I've been wanting to expand my horizons a bit by taking one of my >>>>>>>> programs and rewriting it into a number of other languages. It >>>>>>>> started life in python, and I've recoded it into perl >>>>>>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html). >>>>>>>> >>>>>>>> Next on my list is java. After that I'll probably do Haskell and >>>>>>>> Eiffel/Sather. >>>>>>>> >>>>>>>> So the python and perl versions were pretty easy, but I'm >>>>>>>> finding that >>>>>>>> the java version has a somewhat solution-resistant problem with >>>>>>>> non-ASCII filenames. >>>>>>>> >>>>>>>> The program just reads filenames from stdin (usually generated >>>>>>>> with >>>>>>>> the *ix find command), and then compares those files, dividing >>>>>>>> them up >>>>>>>> into equal groups. >>>>>>>> >>>>>>>> The problem with the java version, which manifests both with >>>>>>>> OpenJDK >>>>>>>> and gcj, is that the filenames being read from disk are 8 bit, >>>>>>>> and the >>>>>>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are >>>>>>>> 8 bit, >>>>>>>> but as far as the java language is concerned, those filenames >>>>>>>> are made >>>>>>>> up of 16 bit characters. That's fine, but going from 8 to 16 >>>>>>>> bit and >>>>>>>> back to 8 bit seems to be non-information-preserving in this case, >>>>>>>> which isn't so fine - I can clearly see the program, in an strace, >>>>>>>> reading with one sequence of bytes, but then trying to open >>>>>>>> another-though-related sequence of bytes. To be perfectly >>>>>>>> clear: It's >>>>>>>> getting file not found errors. >>>>>>>> >>>>>>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can >>>>>>>> get the >>>>>>>> program to handle files with one encoding, but not another. I've >>>>>>>> tried a bunch of values in these variables, including >>>>>>>> ISO-8859-1, C, >>>>>>>> POSIX, UTF-8, and so on. >>>>>>>> >>>>>>>> Is there such a thing as a filename encoding that will map 8 bit >>>>>>>> filenames to 16 bit characters, but only using the low 8 bits >>>>>>>> of those >>>>>>>> 16, and then map back to 8 bit filenames only using those low 8 >>>>>>>> bits >>>>>>>> again? >>>>>>>> >>>>>>>> Is there some other way of making a Java program on Linux able >>>>>>>> to read >>>>>>>> filenames from stdin and later open those filenames? >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> >>>>>> -- >>>>>> Naoto Sato >>>>>> >>>>>> >>>> >> > From strombrg at gmail.com Sat Sep 13 22:41:29 2008 From: strombrg at gmail.com (Dan Stromberg) Date: Sat, 13 Sep 2008 22:41:29 -0700 Subject: Reading Linux filenames in a way that will map back the same on open? In-Reply-To: <48CC8773.2050704@sun.com> References: <33c5a6c30809091739j41fad37bybfc92708fa3d0cca@mail.gmail.com> <1ccfd1c10809092058l427e0cd3kb73611878026e6eb@mail.gmail.com> <48C84217.9000709@Sun.COM> <1ccfd1c10809101514g545d2896lb24ed634eca089b5@mail.gmail.com> <48C86B69.2020501@gmail.com> <1ccfd1c10809121249t3e8c4459g7e01ec2c2aa5b2a7@mail.gmail.com> <48CC3734.9040400@sun.com> <48CC6DAD.3030403@gmail.com> <48CC8773.2050704@sun.com> Message-ID: <48CCA409.1030903@gmail.com> Xueming Shen wrote: > Obviously your locale setting is not being "exported"...what "shell" are > you using? It's bash. I'm pretty sure it's exported, because env sees it, and env isn't a shell builtin in bash (at least not yet :). > You can try to set your locale to en_US.ISO8859-1 explicitly at command > line first, > type in "locale" to confirm that your locale is being set correctly to > en_US.ISO8859-1, Good clue: $ export LC_ALL=en_US.ISO8859-1 dstromberg-desktop-dstromberg:~/src/equivs-j i486-pc-linux-gnu 11433 - above cmd done 2008 Sat Sep 13 10:13 PM $ locale locale: Cannot set LC_CTYPE to default locale: No such file or directory locale: Cannot set LC_MESSAGES to default locale: No such file or directory locale: Cannot set LC_ALL to default locale: No such file or directory LANG=en_US.UTF-8 LC_CTYPE="en_US.ISO8859-1" LC_NUMERIC="en_US.ISO8859-1" LC_TIME="en_US.ISO8859-1" LC_COLLATE="en_US.ISO8859-1" LC_MONETARY="en_US.ISO8859-1" LC_MESSAGES="en_US.ISO8859-1" LC_PAPER="en_US.ISO8859-1" LC_NAME="en_US.ISO8859-1" LC_ADDRESS="en_US.ISO8859-1" LC_TELEPHONE="en_US.ISO8859-1" LC_MEASUREMENT="en_US.ISO8859-1" LC_IDENTIFICATION="en_US.ISO8859-1" LC_ALL=en_US.ISO8859-1 It turned out I didn't have en_US.ISO-8859-1 configured on my system. So I used this URL to get it set up: http://ubuntuforums.org/showthread.php?t=423039 I didn't make it my default locale; I just made it a supported locale. And now my program appears to work great, even with non-English filenames - thanks folks! From martinrb at google.com Sun Sep 14 19:34:36 2008 From: martinrb at google.com (Martin Buchholz) Date: Sun, 14 Sep 2008 19:34:36 -0700 Subject: Reading Linux filenames in a way that will map back the same on open? In-Reply-To: <48CC3734.9040400@sun.com> References: <33c5a6c30809091739j41fad37bybfc92708fa3d0cca@mail.gmail.com> <1ccfd1c10809092058l427e0cd3kb73611878026e6eb@mail.gmail.com> <48C84217.9000709@Sun.COM> <1ccfd1c10809101514g545d2896lb24ed634eca089b5@mail.gmail.com> <48C86B69.2020501@gmail.com> <1ccfd1c10809121249t3e8c4459g7e01ec2c2aa5b2a7@mail.gmail.com> <48CC3734.9040400@sun.com> Message-ID: <1ccfd1c10809141934i3031ef80ld8bbc6b1cf7f35f6@mail.gmail.com> On Sat, Sep 13, 2008 at 14:57, Xueming Shen wrote: > > Martin, don't trap people into using -Dfile.encoding, always treat it as a > read only property:-) > > I believe initializeEncoding(env) gets invoked before -Dxyz=abc overwrites > the default one, > beside the "jnu encoding" is introduced in 6.0, so we no longer look > file.encoding since, I believe > you "ARE" the reviewer:-) Oh dear. Sorry for the misinformation. Summary: - newer Linux systems ship without any ISO-8859-1 locale (you can add one using the command "sudo locale-gen en_US") - Specifying "C" or "POSIX" for LANG or LC_ALL will cause Java to use "ASCII", not "ISO-8859-1" as the default locale, which is likely not what you want :-( - The JDK uses two different system properties, sun.jnu.encoding, and file.encoding, for default charset use. This is very confusing, undocumented, and non-standardized. Setting these as system properties on the command line appears to be unsupported. Is that right? I'm pretty unhappy about the situation. Martin