Does Java.g [version 1.0.6] handle unicode characters?

Sat Aug 28 06:19:10 PDT 2010

Hi Yang
Thanks now I get it. Although I did read those comments, I could not
infer from them the "significant changes compared to the Terence
Java.g": skipping standard comments and spaces, not anymore Unicode
handling; IMHO they would worth an explicit note into the file's
changelog section, otherwise they only can surface with a textual diff
on Terence's version.

Thanks again,
Roberto

On Sat, Aug 28, 2010 at 2:46 PM, Yang Jiang <yang.jiang.z at gmail.com> wrote:
> Are you referring to these versions as it appear in the comment?
>
>  *  Version 1.0.5 -- Terence, June 21, 2007
>  *  --a[i].foo didn't work. Fixed unaryExpression
>  *
>  *  Version 1.0.6 -- John Ridgway, March 17, 2008
>  *      Made "assert" a switchable keyword like "enum".
>  *      Fixed compilationUnit to disallow "annotation importDeclaration
> ...".
>
>
> The Java.g on
> http://openjdk.java.net/projects/compiler-grammar/antlrworks/Java.g, (let's
> call it openjdk Java.g)
> is different from those Version 1.0.5 or 1.0.6 you are referring to.
> Although it's derived from the same source - the one wrote by Terence (let's
> call it Terence Java.g)
>
> If you read the comment in the openjdk Java.g carefully, you'll notice this
> line "Below are comments found in the original version. " . Down from there,
> is the comment from the Terence Java.g. It's kept that way to respect the
> copyright.
>
> The openjdk Java.g is developed at Sun and is VERY well tested. It also has
> some significant changes compared to the Terence Java.g.  And one of the
> changes is what you have noticed, that the grammar doesn't not handle
> unicode character representation, this section is taken out from Terence
> Java.g
>
> UnicodeEscape
>     :   '\\' 'u' HexDigit HexDigit HexDigit HexDigit
>     ;
>
> Other changes you probably have read from the comment.
>
> That said, although it's from the same source as your version 1.0.5. It
> should not be considers as a successor to 1.0.5 or 1.0.6.  You might want to
> do more research before upgrading.
>
> yang
>
>
>
> On 08/28/2010 08:12 PM, Roberto Mannai wrote:
>
> So are you saying that if I have the following source file:
> public class TestUnicode {
>         public static void test (String[] args){
>                 char c = '\u0096';
>         }
> }
> which of course compiles, Java.g is not supposed to parse it? Now it
> does not work.
>
> Please note that with the version 1.0.5 it was handled correctly (I'm
> doing some regression tests for understand whether migrate to v. 1.0.6
> or not).
>
> Thanks for your suggestions,
> Roberto
>
> On Sat, Aug 28, 2010 at 1:11 PM, Yang Jiang <yang.jiang.z at gmail.com> wrote:
>
>
> I looked at the grammar again and there is a misunderstanding here.
>
> If you read the grammar carefully, the STRINGLITERAL part,  you'll notice
> the grammar is never supposed to handle input like '\unnnn'.
>
> What happens when you parse a java program is that the input first is fed
> into a tokenizer, then tokenizer emit tokens to the parser.   Inputs like
> '\unnnn' are transformed to corresponding characters before feeding into the
> parser, in the tokenizer or even before the tokenizer. This is how the Sun
> javac parser does it, and Java.g is designed to act this way too. If you are
> going to use Java.g and handle inputs like '\unnnn' you'll have to implement
> this process yourself.
>
> I guess you are using antlrworks to test the grammar,  but antlrworks won't
> do the transformation for you that's why you get the error.
>
> Then what does this mean and why it's put like this?
>
>  Won't pass input containing unicode sequence like this
>      char c = '\uffff'
>      String s = "\uffff";
>
>
> '\uffff' is like any other chars say, 'a', 'b', '"' only there is no visual
> representation. But it can still be present in any java files.
> What this section really means is that Antlr can not handle the character
> represented by '\uffff'.
>
> Hope this solves your problem.
>
> y
>
>
>
>
> On 08/28/2010 06:04 PM, Roberto Mannai wrote:
>
>
> Yes, it does not work with '\u0096'. So am I supposed to (re)open a bug?
> Where?
>
> On Sat, Aug 28, 2010 at 9:42 AM, Yang Jiang<yang.jiang.z at gmail.com>
>  wrote:
>
>
>
> You can change '\uffff' to some other valid chars like '\u0096' etc..
> If that works, then looks like the problem gets back in Antlr 3.2.
>
>
> yang
>
> On 08/28/2010 03:19 PM, Roberto Mannai wrote:
>
>
>
> Hello
>
> [I sent the following message to the antlr-interest mailing list,
> sorry for the cross posting]
>
> I'm trying to understand whether the Java grammar from
> http://openjdk.java.net/projects/compiler-grammar/antlrworks/Java.g
> processes correctly the Unicode chars or not.
>
> In the file's header I read:
> <<
>  *  Know problems:
>  *    Won't pass input containing unicode sequence like this
>  *      char c = '\uffff'
>  *      String s = "\uffff";
>  *    Because Antlr does not treat '\uffff' as an valid char. This
> will be fixed in the next Antlr
>  *    release. [Fixed in Antlr-3.1.1]
>
>
>
>
>
>
> So, it seems that antlr 3.2 should handle the Unicode charset. Anyway,
> when I try to parse the following class:
>
> public class TestUnicode {
>         public static void test (String[] args){
>                 char c = '\uffff';
>         }
> }
>
> I get the following error:
>      line 3:27 no viable alternative at character 'u'
>      line 3:34 mismatched character '\r' expecting '''
>      line 1:7 mismatched input 'class' expecting MONKEYS_AT
>      line 2:22 mismatched input 'void' expecting MONKEYS_AT
>      line 3:21 mismatched input 'c' expecting DOT
>      line 3:23 no viable alternative at input '='
>      line 4:8 no viable alternative at input '}'
>      line 4:8 no viable alternative at input '}'
>
> If I replace the unicode character it of course works. Am I missing
> anything? Please note that version 1.0.5 didn't have this problem.
>
> Thanks for your help.
>
> Roberto
>
>
>
>
>
>
>
>