Does Java.g [version 1.0.6] handle unicode characters?

Yang Jiang yang.jiang.z at gmail.com
Sat Aug 28 04:11:27 PDT 2010


I looked at the grammar again and there is a misunderstanding here.

If you read the grammar carefully, the STRINGLITERAL part,  you'll 
notice the grammar is never supposed to handle input like '\unnnn'.

What happens when you parse a java program is that the input first is 
fed into a tokenizer, then tokenizer emit tokens to the parser.   Inputs 
like '\unnnn' are transformed to corresponding characters before feeding 
into the parser, in the tokenizer or even before the tokenizer. This is 
how the Sun javac parser does it, and Java.g is designed to act this way 
too. If you are going to use Java.g and handle inputs like '\unnnn' 
you'll have to implement this process yourself.

I guess you are using antlrworks to test the grammar,  but antlrworks 
won't do the transformation for you that's why you get the error.

Then what does this mean and why it's put like this?

  Won't pass input containing unicode sequence like this
       char c = '\uffff'
       String s = "\uffff";


'\uffff' is like any other chars say, 'a', 'b', '"' only there is no 
visual representation. But it can still be present in any java files.
What this section really means is that Antlr can not handle the 
character represented by '\uffff'.

Hope this solves your problem.

y




On 08/28/2010 06:04 PM, Roberto Mannai wrote:
> Yes, it does not work with '\u0096'. So am I supposed to (re)open a bug? Where?
>
> On Sat, Aug 28, 2010 at 9:42 AM, Yang Jiang<yang.jiang.z at gmail.com>  wrote:
>    
>> You can change '\uffff' to some other valid chars like '\u0096' etc..
>> If that works, then looks like the problem gets back in Antlr 3.2.
>>
>>
>> yang
>>
>> On 08/28/2010 03:19 PM, Roberto Mannai wrote:
>>      
>>> Hello
>>>
>>> [I sent the following message to the antlr-interest mailing list,
>>> sorry for the cross posting]
>>>
>>> I'm trying to understand whether the Java grammar from
>>> http://openjdk.java.net/projects/compiler-grammar/antlrworks/Java.g
>>> processes correctly the Unicode chars or not.
>>>
>>> In the file's header I read:
>>> <<
>>>   *  Know problems:
>>>   *    Won't pass input containing unicode sequence like this
>>>   *      char c = '\uffff'
>>>   *      String s = "\uffff";
>>>   *    Because Antlr does not treat '\uffff' as an valid char. This
>>> will be fixed in the next Antlr
>>>   *    release. [Fixed in Antlr-3.1.1]
>>>
>>>        
>>>>>
>>>>>            
>>> So, it seems that antlr 3.2 should handle the Unicode charset. Anyway,
>>> when I try to parse the following class:
>>>
>>> public class TestUnicode {
>>>          public static void test (String[] args){
>>>                  char c = '\uffff';
>>>          }
>>> }
>>>
>>> I get the following error:
>>>       line 3:27 no viable alternative at character 'u'
>>>       line 3:34 mismatched character '\r' expecting '''
>>>       line 1:7 mismatched input 'class' expecting MONKEYS_AT
>>>       line 2:22 mismatched input 'void' expecting MONKEYS_AT
>>>       line 3:21 mismatched input 'c' expecting DOT
>>>       line 3:23 no viable alternative at input '='
>>>       line 4:8 no viable alternative at input '}'
>>>       line 4:8 no viable alternative at input '}'
>>>
>>> If I replace the unicode character it of course works. Am I missing
>>> anything? Please note that version 1.0.5 didn't have this problem.
>>>
>>> Thanks for your help.
>>>
>>> Roberto
>>>
>>>        
>>
>>      



More information about the compiler-grammar-dev mailing list