JDK 1.8.0_66, diacritics and file problems, follow up
Xueming Shen
xueming.shen at oracle.com
Thu May 12 16:52:37 UTC 2016
On 5/12/16 8:54 AM, Fabrizio Giudici wrote:
> On Thu, 12 May 2016 14:41:04 +0200, Alan Bateman
> <Alan.Bateman at oracle.com> wrote:
>
>
>> I assume is a decoding and encoding round trip issue, meaning bytes
>> -> String -> bytes. When you use the new file system API then you are
>> using a Path which will use the underlying representation to access
>> the file. Once you call toString or switch to java.io.File the bytes
>> are decoded, and then re-encoded when you access the file via
>> File::exists. Yes, all very subtle.
>
> It's what I presumed. In any case, (unless there's something very
> wrong in my environment: and I can't exclude it yet) I'd call it a
> bug, since the File API is not deprecated and shoudln't behave in a
> different way than NIO.
>
>> So you can summarize the environment? You mention HFS+ and there is a
>> mention about use NFD normalization. There is also a mention of a
>> Raspberry Pi. So are the files transferred between the systems or is
>> there SAMBA or other network access in the picture.
>
> The environment is complex, because of a large number of variants. I'm
> trying to create the simpler testcase as possible, but there are a few
> things that I'm not understanding well.
>
> In any case: the files are .mp3 imported from CDs by iTunes on Mac OS
> X El Capitan (hence the numerous non ASCII characters). Then, the
> files are rsynced to Linux. Here the first complication. rsync (not
> the one distributed with El Capitan, but a version 3.x separately
> installed) has got an --iconv option that can be used, AFAIU, to take
> "a certain" care of the issue (I'm writing "a certain" because there
> are still things that I don't understand). For instance, it seems that
> with BTRFS is makes a difference, eliminating any problem. I have to
> double check.
>
It sounds more like a hfs+/nfd file path issue (file name on nfs+ is in
unicode nfd form). Myabe the
nfd-ed file name gets messed up while doing the utf8-xxx-utf8
conversion. Does that --iconv have
any specific encoding name specified?
"Everything" is supposed to be taken care by #7130915 :-) but something
might be missed.
-sherman
> Question: do you have a suggestion on how to programmatically
> re-create the files with their problematic names without passing
> through that complex setup? I thought about tarring the files, but I'm
> not sure that tar doesn't play in the same field. At the moment I'm
> working like that:
>
> 1. looking at the encoding of the names with something such as ls | od -c
> 2. creating names with that specific encoding by using something like
> System.exec("/bin/touch " + new String(array of bytes as per the
> result of the previous step)).
>
> Does it make sense? If it does, I could share soon a simple Java,
> self-contained test case.
>
More information about the core-libs-dev
mailing list