funny characters in identifiers?

Per Bothner per at bothner.com
Fri Dec 31 13:25:38 PST 2010


On 12/28/2010 01:58 PM, Charles Oliver Nutter wrote:
> On Tue, Dec 28, 2010 at 12:21 PM, Per Bothner<per at bothner.com>  wrote:
>> Is there a plan/consensus for how to handle "illegal" characters
>> in identifiers?  I'm primarily interested in the bytecode level,
>> not the Java source level.  For example identifiers like '/'
>> used for division in Scheme.  It would be good to have a standard
>> way to deal with this.
>
> See John Rose's post on this here:
> http://blogs.sun.com/jrose/entry/symbolic_freedom_in_the_vm
>
> We have implemented it in JRuby, and it works well. The down side is
> that Java backtraces can be a little hard to read when there's lots of
> symbolic identifiers.

A problem with this mangling is that it isn't "safe" for class names,
or at least not for class files.   Using '\' in a filename is obviously
problematical, especially on Windows.  On Posix-based file system the
funny characters are in principle allowed, but will of course be awkward
to access from shells and other tools.

Windows disallows the following in file names:
< (less than)
> (greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
http://msdn.microsoft.com/en-us/library/aa365247(v=vs.85).aspx
(And of course we have problems with-insensitive file systems.)

Now of course we can use an annotation to specify the source class name
in case the source class name is invalid - but then we still need to
mangle the class name somehow.

I think a better prefix character would be '%'.  It's not reserved
for Posix or Windows or JVM, while not being a valid Java character.
Even better might be '~' or '!' since those are also unreserved for URIs.
I will assume '~' in the following.

If we want names that a "safe for filenames" or even "safe for URIs"
then the problem is that there are too many unsafe characters to
encode as '~' followed a safe non-alphanumeric.  Which means that
we need to use '`' followed by a *letter*.

For example:
'/' -> '~s' (mnemonic: slash)
'.' -> '~d' (dot)
'<' => '~l' (less)
etc etc

What about non-Ascii characters?  I don't know enough to know if
such characters might cause a problem, but don't know of any reason.
They might technically be disallowed by URIs, but my impression
%-mangling is handled somewhat universally and semi-transparently.
-- 
	--Per Bothner
per at bothner.com   http://per.bothner.com/


More information about the mlvm-dev mailing list