JDK 1.8.0_66, diacritics and file problems, follow up

Fabrizio Giudici Fabrizio.Giudici at tidalwave.it
Thu May 12 15:54:35 UTC 2016


On Thu, 12 May 2016 14:41:04 +0200, Alan Bateman <Alan.Bateman at oracle.com>  
wrote:


> I assume is a decoding and encoding round trip issue, meaning bytes ->  
> String -> bytes. When you use the new file system API then you are using  
> a Path which will use the underlying representation to access the file.  
> Once you call toString or switch to java.io.File the bytes are decoded,  
> and then re-encoded when you access the file via File::exists. Yes, all  
> very subtle.

It's what I presumed. In any case, (unless there's something very wrong in  
my environment: and I can't exclude it yet) I'd call it a bug, since the  
File API is not deprecated and shoudln't behave in a different way than  
NIO.

> So you can summarize the environment? You mention HFS+ and there is a  
> mention about use NFD normalization. There is also a mention of a  
> Raspberry Pi. So are the files transferred between the systems or is  
> there SAMBA or other network access in the picture.

The environment is complex, because of a large number of variants. I'm  
trying to create the simpler testcase as possible, but there are a few  
things that I'm not understanding well.

In any case: the files are .mp3 imported from CDs by iTunes on Mac OS X El  
Capitan (hence the numerous non ASCII characters). Then, the files are  
rsynced to Linux. Here the first complication. rsync (not the one  
distributed with El Capitan, but a version 3.x separately installed) has  
got an --iconv option that can be used, AFAIU, to take "a certain" care of  
the issue (I'm writing "a certain" because there are still things that I  
don't understand). For instance, it seems that with BTRFS is makes a  
difference, eliminating any problem. I have to double check.

Question: do you have a suggestion on how to programmatically re-create  
the files with their problematic names without passing through that  
complex setup? I thought about tarring the files, but I'm not sure that  
tar doesn't play in the same field. At the moment I'm working like that:

1. looking at the encoding of the names with something such as ls | od -c
2. creating names with that specific encoding by using something like  
System.exec("/bin/touch " + new String(array of bytes as per the result of  
the previous step)).

Does it make sense? If it does, I could share soon a simple Java,  
self-contained test case.

-- 
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - fabrizio.giudici at tidalwave.it



More information about the core-libs-dev mailing list