<i18n dev> Is(n't) this a Java Unicode compiler bug? [4=OSCON]

Tue Jul 19 07:52:52 PDT 2011

In preparation for a short talk I'm to give in Portland next week about how
different operating systems, filesystems, and languages (including but not
limited to regexes) handle Unicode, I got to thinking about normalization
issues.  And I think I've found a Java compiler bug, or at best, an
infelicity in a grey area.  

It's no consolation, but Perl has exactly the same problem (well, pair of
problems) as Java has here.  We do the same thing as Java, which I think is
the Wrong Thing, and we are also at the mercy of our filesystem for mapping
of classnames to filesystem objects, which is even worse.

I would like someone to tell me why Java shouldn't be fixed to cope with
these matters, both as internal identifiers and as those that exist outside
Java proper, in the filesystem (classnames).

After reading about the differences between how Apple and
Sun did normalization in the filesystem:

    http://developers.sun.com/global/products_platforms/solaris/reference/presentations/IUC29-FileSystems.pdf

I wondered what impact this might/must have on Java.  After all,
classnames must map to filesystem entries, and therefore if the system
is doing any kind of normalization, you're going to have Issues.  Apple
runs everything through NFD (well, nearly), whereas the Sun paper cited
from about five years ago says that they plan to do something analogous
to how "case-preserving but case-insensitive" filesystems behave: that is,
they'll let you create anything you want, but won't let you create a new
entry in the same directory if they are canonically equivalent.

Before I went so far as to test this on Apple and Sun machines, let
alone others, I thought I would just try my test on local variables
instead.  I have now tested it on Sun, Apple, and Linux, including 
various versions of the compiler, and they all report the same thing.
And the thing they report, I feel, is wrong, because I know that it
will not work this way for class names the way it will for local versions.

I will include the source code twice, once as plain text so you can 
read it, and once as an octet stream lest a "helpful" mailer decides
it should be normalizing things that pass through it, an evil that the
Apple mouse will do to you believe it or not.

If you put this wicked file in a file called "nftest.java" and run 
this command:

    $ javac -encoding UTF-8 nftest.java && java nftest

Then you will get this output:

    élève is 1.
    élève is 2.
    élève is 3.
    élève is 4.

Those probably look the same.  Running them through `uniquote -x` shows
though that they are not:

    \x{E9}l\x{E8}ve is 1.
    e\x{301}le\x{300}ve is 2.
    \x{E9}le\x{300}ve is 3.
    e\x{301}l\x{E8}ve is 4.

See the difference?  Those are variable names, and I do not think Java
should permit duplicate variable names that differ only in normalization,
since it obviously cannot be permitted to do so for classnames, and it
feels hackish to have different identifier rules for classnames as for
other variables.

Is this is a bug?  If so, are there plans to address it?  
And what about the filesystem?

I am unaware of any document in The Unicode Standard that references
either or both of these issues; if any such exist, kindly point me at
them.  My hunch is that these two problems, even though they are
completely consequential to Unicode, exist beyond the proper purview
of The Unicode Standard itself.  But that doesn't absolve us from
solving them.

Has this been previously discussed, and if so, what if any decision 
was made regarding these two interrelated problems?

Thank you very much.

--tom

    PS: The MIME contents of this message are as follows:

 msg part  type/subtype              size description                         
   1       multipart/mixed           8904
     1     text/plain                4071 a letter from tchrist               
     2     application/octet-stream  1560 the nftest(-v1).java program as octets
             name="nftest-v1.java"
             filename="nftest-v1.java"
     3     text/plain                1560 the nftest(-v2).java program as plain text

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/octet-stream
Size: 1560 bytes
Desc: the nftest(-v1).java program as octets
Url : http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110719/33d27ac5/attachment.obj 
-------------- next part --------------
/* 
 *  nftest.java
 *  Tom Christiansen <tchrist at perl.com>
 *  Tue Jul 19 08:13:29 MDT 2011
 *
 *    This tests whether Java normalizes its variable names.
 *    We will use four different canonically equivalent strings,
 *    as see if we can get four different answers, or a compilation
 *    error.
 *
 *      N  String    As a literal        Graphemes   Chars    Norm?
 *     =============================================================
 *      1  élève    "\x{E9}l\x{E8}ve"        5         5      NFC
 *      2  élève    "e\x{301}le\x{300}ve"    5         7      NFD
 *      3  élève    "\x{E9}le\x{300}ve"      5         6      mixed
 *      4  élève    "e\x{301}l\x{E8}ve"      5         6      mixed
 */

import java.io.*;

public class nftest { 
    static PrintStream stdout;

    public static void main(String argv[]) 
        throws IOException
    { 
        int élève = 1;   // "\x{E9}l\x{E8}ve"       NFC
        int élève = 2;   // "e\x{301}le\x{300}ve"   NFD
        int élève = 3;   // "\x{E9}le\x{300}ve"     mixed
        int élève = 4;   // "e\x{301}l\x{E8}ve"     mixed

        stdout = new PrintStream(System.out, true, "UTF-8");

        stdout.printf("%s is %d.\n", "élève", élève); // "\x{E9}l\x{E8}ve"       NFC
        stdout.printf("%s is %d.\n", "élève", élève); // "e\x{301}le\x{300}ve"   NFD
        stdout.printf("%s is %d.\n", "élève", élève); // "\x{E9}le\x{300}ve"     mixed
        stdout.printf("%s is %d.\n", "élève", élève); // "e\x{301}l\x{E8}ve"     mixed

    }
}