funny characters in identifiers?

BGB cr88192 at gmail.com
Fri Dec 31 14:36:57 PST 2010


On 12/31/2010 2:25 PM, Per Bothner wrote:
> On 12/28/2010 01:58 PM, Charles Oliver Nutter wrote:
>> On Tue, Dec 28, 2010 at 12:21 PM, Per Bothner<per at bothner.com>   wrote:
>>> Is there a plan/consensus for how to handle "illegal" characters
>>> in identifiers?  I'm primarily interested in the bytecode level,
>>> not the Java source level.  For example identifiers like '/'
>>> used for division in Scheme.  It would be good to have a standard
>>> way to deal with this.
>> See John Rose's post on this here:
>> http://blogs.sun.com/jrose/entry/symbolic_freedom_in_the_vm
>>
>> We have implemented it in JRuby, and it works well. The down side is
>> that Java backtraces can be a little hard to read when there's lots of
>> symbolic identifiers.
> A problem with this mangling is that it isn't "safe" for class names,
> or at least not for class files.   Using '\' in a filename is obviously
> problematical, especially on Windows.  On Posix-based file system the
> funny characters are in principle allowed, but will of course be awkward
> to access from shells and other tools.
>
> Windows disallows the following in file names:
> <  (less than)
>> (greater than)
> : (colon)
> " (double quote)
> / (forward slash)
> \ (backslash)
> | (vertical bar or pipe)
> ? (question mark)
> * (asterisk)
> http://msdn.microsoft.com/en-us/library/aa365247(v=vs.85).aspx
> (And of course we have problems with-insensitive file systems.)
>
> Now of course we can use an annotation to specify the source class name
> in case the source class name is invalid - but then we still need to
> mangle the class name somehow.
>
> I think a better prefix character would be '%'.  It's not reserved
> for Posix or Windows or JVM, while not being a valid Java character.
> Even better might be '~' or '!' since those are also unreserved for URIs.
> I will assume '~' in the following.
>
> If we want names that a "safe for filenames" or even "safe for URIs"
> then the problem is that there are too many unsafe characters to
> encode as '~' followed a safe non-alphanumeric.  Which means that
> we need to use '`' followed by a *letter*.
>
> For example:
> '/' ->  '~s' (mnemonic: slash)
> '.' ->  '~d' (dot)
> '<' =>  '~l' (less)
> etc etc
>
> What about non-Ascii characters?  I don't know enough to know if
> such characters might cause a problem, but don't know of any reason.
> They might technically be disallowed by URIs, but my impression
> %-mangling is handled somewhat universally and semi-transparently.

just my quick comment...

in my VM, I ended up using a variation on JNI name-mangling for pretty 
much anything needing mangling (including filenames...).

however, I did add a few additional escapes (for a few other common 
characters), and ended up adding a _9xx escape in addition to the _0xxxx 
escape.

list of other escapes:
'_' with '_1';
';' with '_2';
'[' with '_3';
'(' with '_4';
')' with '_5';
'/' with '_6'.

as well, '__' was used as a string-break (mostly when encoding a list of 
strings as a single token).

so, little says similar couldn't be used in the class filenames if 
needed as well...


dunno if this helps for anything...



More information about the mlvm-dev mailing list