<i18n dev> Is(n't) this a Java Unicode compiler bug? [4=OSCON]
Xueming Shen
xueming.shen at oracle.com
Tue Jul 19 10:31:23 PDT 2011
Tom,
JLS 3.8 [1] Identifiers states
"Two identifiers are the same only if they are identical, that is, have
the same Unicode character
for each letter or digit/./
Identifiers that have the same external appearance may yet be different.
For example, the
identifiers consisting of the single letters LATIN CAPITAL LETTER A
(|A|, |\u0041|), LATIN SMALL
LETTER A (|a|, |\u0061|), GREEK CAPITAL LETTER ALPHA (|A|, |\u0391|),
CYRILLIC SMALL
LETTER A (|a|, |\u0430|) and MATHEMATICAL BOLD ITALIC SMALL A (|a|,
|\ud835\udc82|)
are all different.
Unicode composite characters are different from the decomposed
characters. For example, a
LATIN CAPITAL LETTER A ACUTE (|Á,| |\u00c1)| could be considered to be
the same as a
LATIN CAPITAL LETTER A (|A|, |\u0041)| immediately followed by a
NON-SPACING ACUTE
(´, |\u0301|) when sorting, but these are different in identifiers."
We happened to have a short discussion regarding this section couple
days ago (Alex is working
on the latest JLS, and we were discussing the possibility of re-wording
this section a little)...so
at least for now those are NOT duplicate identifiers from Java language
specification. It might
be an implementation issue though, if the file system works in a
different way, as you suggested.
-Sherman
[1] http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.8
On 7/19/2011 7:52 AM, Tom Christiansen wrote:
> In preparation for a short talk I'm to give in Portland next week about how
> different operating systems, filesystems, and languages (including but not
> limited to regexes) handle Unicode, I got to thinking about normalization
> issues. And I think I've found a Java compiler bug, or at best, an
> infelicity in a grey area.
>
> It's no consolation, but Perl has exactly the same problem (well, pair of
> problems) as Java has here. We do the same thing as Java, which I think is
> the Wrong Thing, and we are also at the mercy of our filesystem for mapping
> of classnames to filesystem objects, which is even worse.
>
> I would like someone to tell me why Java shouldn't be fixed to cope with
> these matters, both as internal identifiers and as those that exist outside
> Java proper, in the filesystem (classnames).
>
> After reading about the differences between how Apple and
> Sun did normalization in the filesystem:
>
> http://developers.sun.com/global/products_platforms/solaris/reference/presentations/IUC29-FileSystems.pdf
>
> I wondered what impact this might/must have on Java. After all,
> classnames must map to filesystem entries, and therefore if the system
> is doing any kind of normalization, you're going to have Issues. Apple
> runs everything through NFD (well, nearly), whereas the Sun paper cited
> from about five years ago says that they plan to do something analogous
> to how "case-preserving but case-insensitive" filesystems behave: that is,
> they'll let you create anything you want, but won't let you create a new
> entry in the same directory if they are canonically equivalent.
>
> Before I went so far as to test this on Apple and Sun machines, let
> alone others, I thought I would just try my test on local variables
> instead. I have now tested it on Sun, Apple, and Linux, including
> various versions of the compiler, and they all report the same thing.
> And the thing they report, I feel, is wrong, because I know that it
> will not work this way for class names the way it will for local versions.
>
> I will include the source code twice, once as plain text so you can
> read it, and once as an octet stream lest a "helpful" mailer decides
> it should be normalizing things that pass through it, an evil that the
> Apple mouse will do to you believe it or not.
>
> If you put this wicked file in a file called "nftest.java" and run
> this command:
>
> $ javac -encoding UTF-8 nftest.java&& java nftest
>
> Then you will get this output:
>
> élève is 1.
> élève is 2.
> élève is 3.
> élève is 4.
>
> Those probably look the same. Running them through `uniquote -x` shows
> though that they are not:
>
> \x{E9}l\x{E8}ve is 1.
> e\x{301}le\x{300}ve is 2.
> \x{E9}le\x{300}ve is 3.
> e\x{301}l\x{E8}ve is 4.
>
> See the difference? Those are variable names, and I do not think Java
> should permit duplicate variable names that differ only in normalization,
> since it obviously cannot be permitted to do so for classnames, and it
> feels hackish to have different identifier rules for classnames as for
> other variables.
>
> Is this is a bug? If so, are there plans to address it?
> And what about the filesystem?
>
> I am unaware of any document in The Unicode Standard that references
> either or both of these issues; if any such exist, kindly point me at
> them. My hunch is that these two problems, even though they are
> completely consequential to Unicode, exist beyond the proper purview
> of The Unicode Standard itself. But that doesn't absolve us from
> solving them.
>
> Has this been previously discussed, and if so, what if any decision
> was made regarding these two interrelated problems?
>
> Thank you very much.
>
> --tom
>
> PS: The MIME contents of this message are as follows:
>
> msg part type/subtype size description
> 1 multipart/mixed 8904
> 1 text/plain 4071 a letter from tchrist
> 2 application/octet-stream 1560 the nftest(-v1).java program as octets
> name="nftest-v1.java"
> filename="nftest-v1.java"
> 3 text/plain 1560 the nftest(-v2).java program as plain text
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110719/0262160b/attachment.html
More information about the i18n-dev
mailing list