Does Java.g [version 1.0.6] handle unicode characters?

Yang Jiang yang.jiang.z at gmail.com
Sat Aug 28 05:46:48 PDT 2010


Are you referring to these versions as it appear in the comment?

  *  Version 1.0.5 -- Terence, June 21, 2007
  *  --a[i].foo didn't work. Fixed unaryExpression
  *
  *  Version 1.0.6 -- John Ridgway, March 17, 2008
  *      Made "assert" a switchable keyword like "enum".
  *      Fixed compilationUnit to disallow "annotation importDeclaration 
...".


The Java.g on 
http://openjdk.java.net/projects/compiler-grammar/antlrworks/Java.g, 
(let's call it openjdk Java.g)
is different from those Version 1.0.5 or 1.0.6 you are referring to.  
Although it's derived from the same source - the one wrote by Terence 
(let's call it Terence Java.g)

If you read the comment in the openjdk Java.g carefully, you'll notice 
this line "Below are comments found in the original version. " . Down 
from there, is the comment from the Terence Java.g. It's kept that way 
to respect the copyright.

The openjdk Java.g is developed at Sun and is VERY well tested. It also 
has some significant changes compared to the Terence Java.g.  And one of 
the changes is what you have noticed, that the grammar doesn't not 
handle unicode character representation, this section is taken out from 
Terence Java.g

UnicodeEscape
     :   '\\' 'u' HexDigit HexDigit HexDigit HexDigit
     ;

Other changes you probably have read from the comment.

That said, although it's from the same source as your version 1.0.5. It 
should not be considers as a successor to 1.0.5 or 1.0.6.  You might 
want to do more research before upgrading.

yang



On 08/28/2010 08:12 PM, Roberto Mannai wrote:
> So are you saying that if I have the following source file:
> public class TestUnicode {
>          public static void test (String[] args){
>                  char c = '\u0096';
>          }
> }
> which of course compiles, Java.g is not supposed to parse it? Now it
> does not work.
>
> Please note that with the version 1.0.5 it was handled correctly (I'm
> doing some regression tests for understand whether migrate to v. 1.0.6
> or not).
>
> Thanks for your suggestions,
> Roberto
>
> On Sat, Aug 28, 2010 at 1:11 PM, Yang Jiang<yang.jiang.z at gmail.com>  wrote:
>    
>> I looked at the grammar again and there is a misunderstanding here.
>>
>> If you read the grammar carefully, the STRINGLITERAL part,  you'll notice
>> the grammar is never supposed to handle input like '\unnnn'.
>>
>> What happens when you parse a java program is that the input first is fed
>> into a tokenizer, then tokenizer emit tokens to the parser.   Inputs like
>> '\unnnn' are transformed to corresponding characters before feeding into the
>> parser, in the tokenizer or even before the tokenizer. This is how the Sun
>> javac parser does it, and Java.g is designed to act this way too. If you are
>> going to use Java.g and handle inputs like '\unnnn' you'll have to implement
>> this process yourself.
>>
>> I guess you are using antlrworks to test the grammar,  but antlrworks won't
>> do the transformation for you that's why you get the error.
>>
>> Then what does this mean and why it's put like this?
>>
>>   Won't pass input containing unicode sequence like this
>>       char c = '\uffff'
>>       String s = "\uffff";
>>
>>
>> '\uffff' is like any other chars say, 'a', 'b', '"' only there is no visual
>> representation. But it can still be present in any java files.
>> What this section really means is that Antlr can not handle the character
>> represented by '\uffff'.
>>
>> Hope this solves your problem.
>>
>> y
>>
>>
>>
>>
>> On 08/28/2010 06:04 PM, Roberto Mannai wrote:
>>      
>>> Yes, it does not work with '\u0096'. So am I supposed to (re)open a bug?
>>> Where?
>>>
>>> On Sat, Aug 28, 2010 at 9:42 AM, Yang Jiang<yang.jiang.z at gmail.com>
>>>   wrote:
>>>
>>>        
>>>> You can change '\uffff' to some other valid chars like '\u0096' etc..
>>>> If that works, then looks like the problem gets back in Antlr 3.2.
>>>>
>>>>
>>>> yang
>>>>
>>>> On 08/28/2010 03:19 PM, Roberto Mannai wrote:
>>>>
>>>>          
>>>>> Hello
>>>>>
>>>>> [I sent the following message to the antlr-interest mailing list,
>>>>> sorry for the cross posting]
>>>>>
>>>>> I'm trying to understand whether the Java grammar from
>>>>> http://openjdk.java.net/projects/compiler-grammar/antlrworks/Java.g
>>>>> processes correctly the Unicode chars or not.
>>>>>
>>>>> In the file's header I read:
>>>>> <<
>>>>>   *  Know problems:
>>>>>   *    Won't pass input containing unicode sequence like this
>>>>>   *      char c = '\uffff'
>>>>>   *      String s = "\uffff";
>>>>>   *    Because Antlr does not treat '\uffff' as an valid char. This
>>>>> will be fixed in the next Antlr
>>>>>   *    release. [Fixed in Antlr-3.1.1]
>>>>>
>>>>>
>>>>>            
>>>>>>>
>>>>>>>                
>>>>> So, it seems that antlr 3.2 should handle the Unicode charset. Anyway,
>>>>> when I try to parse the following class:
>>>>>
>>>>> public class TestUnicode {
>>>>>          public static void test (String[] args){
>>>>>                  char c = '\uffff';
>>>>>          }
>>>>> }
>>>>>
>>>>> I get the following error:
>>>>>       line 3:27 no viable alternative at character 'u'
>>>>>       line 3:34 mismatched character '\r' expecting '''
>>>>>       line 1:7 mismatched input 'class' expecting MONKEYS_AT
>>>>>       line 2:22 mismatched input 'void' expecting MONKEYS_AT
>>>>>       line 3:21 mismatched input 'c' expecting DOT
>>>>>       line 3:23 no viable alternative at input '='
>>>>>       line 4:8 no viable alternative at input '}'
>>>>>       line 4:8 no viable alternative at input '}'
>>>>>
>>>>> If I replace the unicode character it of course works. Am I missing
>>>>> anything? Please note that version 1.0.5 didn't have this problem.
>>>>>
>>>>> Thanks for your help.
>>>>>
>>>>> Roberto
>>>>>
>>>>>
>>>>>            
>>>>
>>>>          
>>
>>      

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/compiler-grammar-dev/attachments/20100828/840e9732/attachment-0001.html 


More information about the compiler-grammar-dev mailing list