JDK 1.8.0 33/40, diacritics and file problems

Mon Apr 27 13:13:46 UTC 2015

Heh. Welcome to Unicode :(

Unfortunately, many file systems especially in the UNIX world do not
precisely specify how file names are to be encoded. Some of them treat file
names as opaque null terminated byte arrays.

Thus this may not be a bug in Java so much as a design problem/oversight
with the operating systems themselves.

Note that the issue you're running in to is *not* to do with encodings.
It's not a UTF-8 vs UTF-16 type issue. Rather, the issue is that Unicode
allows visually identical strings to be represented differently at the
logical layer, using different sequences of code points.

You didn't say what app originally saved the files. However, what exact
sequence of code points you get on disk for a given piece of human readable
text can depend on things as varying as what input method editor the user
typed the file name with, precisely what combination of keys they pressed
and when, what libraries the app used, and so on.

Yes it's a mess.

If you encounter such situations frequently then your best bet may be to
simply write a little wrapper that tries different normalisations until it
finds one that works.