<i18n dev> Is(n't) this a Java Unicode compiler bug? [4=OSCON]

Tue Jul 19 10:31:23 PDT 2011

  Tom,

JLS 3.8 [1] Identifiers states

"Two identifiers are the same only if they are identical, that is, have 
the same Unicode character
for each letter or digit/./

Identifiers that have the same external appearance may yet be different. 
For example, the
identifiers consisting of the single letters LATIN CAPITAL LETTER A 
(|A|, |\u0041|), LATIN SMALL
LETTER A (|a|, |\u0061|), GREEK CAPITAL LETTER ALPHA (|A|, |\u0391|), 
CYRILLIC SMALL
LETTER A (|a|, |\u0430|) and MATHEMATICAL BOLD ITALIC SMALL A (|a|, 
|\ud835\udc82|)
are all different.

Unicode composite characters are different from the decomposed 
characters. For example, a
LATIN CAPITAL LETTER A ACUTE (|Á,| |\u00c1)| could be considered to be 
the same as a
LATIN CAPITAL LETTER A (|A|, |\u0041)| immediately followed by a 
NON-SPACING ACUTE
(´, |\u0301|) when sorting, but these are different in identifiers."

We happened to have a short discussion regarding this section couple 
days ago (Alex is working
on the latest JLS, and we were discussing the possibility of re-wording 
this section a little)...so
at least for now those are NOT duplicate identifiers from Java language 
specification. It might
be an implementation issue though, if the file system works in a 
different way, as you suggested.

-Sherman

[1] http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.8

On 7/19/2011 7:52 AM, Tom Christiansen wrote:
> In preparation for a short talk I'm to give in Portland next week about how
> different operating systems, filesystems, and languages (including but not
> limited to regexes) handle Unicode, I got to thinking about normalization
> issues.  And I think I've found a Java compiler bug, or at best, an
> infelicity in a grey area.
>
> It's no consolation, but Perl has exactly the same problem (well, pair of
> problems) as Java has here.  We do the same thing as Java, which I think is
> the Wrong Thing, and we are also at the mercy of our filesystem for mapping
> of classnames to filesystem objects, which is even worse.
>
> I would like someone to tell me why Java shouldn't be fixed to cope with
> these matters, both as internal identifiers and as those that exist outside
> Java proper, in the filesystem (classnames).
>
> After reading about the differences between how Apple and
> Sun did normalization in the filesystem:
>
>      http://developers.sun.com/global/products_platforms/solaris/reference/presentations/IUC29-FileSystems.pdf
>
> I wondered what impact this might/must have on Java.  After all,
> classnames must map to filesystem entries, and therefore if the system
> is doing any kind of normalization, you're going to have Issues.  Apple
> runs everything through NFD (well, nearly), whereas the Sun paper cited
> from about five years ago says that they plan to do something analogous
> to how "case-preserving but case-insensitive" filesystems behave: that is,
> they'll let you create anything you want, but won't let you create a new
> entry in the same directory if they are canonically equivalent.
>
> Before I went so far as to test this on Apple and Sun machines, let
> alone others, I thought I would just try my test on local variables
> instead.  I have now tested it on Sun, Apple, and Linux, including
> various versions of the compiler, and they all report the same thing.
> And the thing they report, I feel, is wrong, because I know that it
> will not work this way for class names the way it will for local versions.
>
> I will include the source code twice, once as plain text so you can
> read it, and once as an octet stream lest a "helpful" mailer decides
> it should be normalizing things that pass through it, an evil that the
> Apple mouse will do to you believe it or not.
>
> If you put this wicked file in a file called "nftest.java" and run
> this command:
>
>      $ javac -encoding UTF-8 nftest.java&&  java nftest
>
> Then you will get this output:
>
>      élève is 1.
>      élève is 2.
>      élève is 3.
>      élève is 4.
>
> Those probably look the same.  Running them through `uniquote -x` shows
> though that they are not:
>
>      \x{E9}l\x{E8}ve is 1.
>      e\x{301}le\x{300}ve is 2.
>      \x{E9}le\x{300}ve is 3.
>      e\x{301}l\x{E8}ve is 4.
>
> See the difference?  Those are variable names, and I do not think Java
> should permit duplicate variable names that differ only in normalization,
> since it obviously cannot be permitted to do so for classnames, and it
> feels hackish to have different identifier rules for classnames as for
> other variables.
>
> Is this is a bug?  If so, are there plans to address it?
> And what about the filesystem?
>
> I am unaware of any document in The Unicode Standard that references
> either or both of these issues; if any such exist, kindly point me at
> them.  My hunch is that these two problems, even though they are
> completely consequential to Unicode, exist beyond the proper purview
> of The Unicode Standard itself.  But that doesn't absolve us from
> solving them.
>
> Has this been previously discussed, and if so, what if any decision
> was made regarding these two interrelated problems?
>
> Thank you very much.
>
> --tom
>
>      PS: The MIME contents of this message are as follows:
>
>   msg part  type/subtype              size description
>     1       multipart/mixed           8904
>       1     text/plain                4071 a letter from tchrist
>       2     application/octet-stream  1560 the nftest(-v1).java program as octets
>               name="nftest-v1.java"
>               filename="nftest-v1.java"
>       3     text/plain                1560 the nftest(-v2).java program as plain text
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110719/0262160b/attachment.html