PRE-PROPOSAL: Source and Encoding keyword

Joseph D. Darcy Joe.Darcy at Sun.COM
Sat Mar 7 17:38:34 PST 2009


Reinier Zwitserloot wrote:
> We have written up a proposal for adding a 'source' and 'encoding'  
> keyword (alternatives to the -source and -encoding keywords on the  
> command line; they work pretty much just as you expect). The keywords  
> are context sensitive and must both appear before anything else other  
> than comments to be parsed. In case the benefit isn't obvious: It is a  
> great help when you are trying to port a big project to a new source  
> language compatibility. Leaving half your sourcebase in v1.6 and the  
> other half in v1.7 is pretty much impossible today, it's all-or- 
> nothing. It should also be a much nicer solution to the 'assert in  
> v1.4' dilemma, which I guess is going to happen to v1.7 as well, given  
> that 'module' is most likely going to become a keyword. 

Your assumption about 1.7 is incorrect.  As explained here

    http://openjdk.java.net/projects/jigsaw/doc/language.html

"module" will be a restricted keyword, that is, it is only a keyword in 
a module-info file and can otherwise still appear as the name of a 
method or field, etc.

We no not plan to add anymore keywords that would invalidate existing code.

> Finally, it  
> makes java files a lot more portable; you no longer run into your  
> strings looking weird when you move your Windows-1252 codefile java  
> source to a mac, for example.
>   

For good build hygiene, one should always specify -source and -encoding.

> Before we finish it though, some open questions we'd like some  
> feedback on:
>
> A) Technically, starting a file with "source 1.4" is obviously silly;  
> javac v1.4 doesn't know about the source keyword and would thus fail  
> immediately. However, practically, its still useful. Example: if  
> you've mostly converted a GWT project to GWT 1.5 (which uses java 1.5  
> syntax), but have a few files remaining on GWT v1.4 (which uses java  
> 1.4 syntax), then tossing a "source 1.4;" in those older files  
> eliminates all the generics warnings and serves as a reminder that you  
> should still convert those at some point. However, it isn't -actually-  
> compatible with a real javac 1.4. We're leaning to making "source  
> 1.6;"  (and below) legal even when using a javac v1.7 or above, but  
> perhaps that's a bridge too far? We could go with magic comments but  
> that seems like a very bad solution.
>
> also:
>
> Encoding is rather a hairy issue; javac will need to read the file to  
> find the encoding, but to read a file, it needs to know about  
> encoding! Fortunately, *every single* popular encoding on wikipedia's  
> popular encoding list at:
>
> http://en.wikipedia.org/wiki/Character_encoding#Popular_character_encodings
>
> will encode "encoding own-name-in-that-encoding;" the same as ASCII  
> would, except for KOI-7 and UTF-7, (both 7 bit encodings that I doubt  
> anyone ever uses to program java).
>   

Yes, not all encodings have the ASCII subset in a one-byte format.

Both of these pre-proposals are far to complicated and offer too little 
value for inclusion in Project Coin.

-Joe


> Therefore, the proposal includes the following strategy to find the  
> encoding statement in a java source file without knowing the encoding  
> beforehand:
>
> An entirely separate parser (the encoding parser) is run repeatedly  
> until the right encoding is found. First it'll decode the input with  
> ISO-8859-1. If that doesn't work, UTF-16 (assume BE if no BOM, as per  
> the java standard), then as UTF-32 (BE if no BOM), then the current  
> behaviour (-encoding parameter's value if any, otherwise platform  
> default encoding). This separate parser works as follows:
>
> 1. Ignore any comments and whitespace.
> 3. Ignore the pattern (regexp-like-syntax, ): source\s+[^\s]+\s*; - if  
> that pattern matches partially but is not correctly completed, that  
> parser run exits without finding an encoding, immediately.
> 4. Find the pattern: encoding\s+([^\s]+)\s*; - if that pattern matches  
> partially but is not correctly completed, that parser run exists  
> without finding an encoding, immediately. If it does complete, the  
> parser also exists immediately and returns the captured value.
> 5. If it finds anything else, stop immediately, returning no encoding  
> found.
>
> Once it's found something, the 'real' java parser will run using the  
> found encoding (this overrides any -encoding on the command line).  
> Note that the encoding parser stops quickly; For example, if it finds  
> a stray \0 or e.g. the letter 'i' (perhaps the first letter of an  
> import statement), it'll stop immediately.
>
> If an encoding is encountered that was not found during the standard  
> decoding strategy (ISO-8859-1, UTF-16, UTF-32), but worked only due to  
> a platform default/command line encoding param, (e.g. a platform that  
> defaults to UTF-16LE without a byte order mark) a warning explaining  
> that the encoding statement isn't doing anything is generated. Of  
> course, if the encoding doesn't match itself, you get an error  
> (putting "encoding UTF-16;" into a UTF-8 encoded file for example). If  
> there is no encoding statement, the 'real' java parser does what it  
> does now: Use the -encoding parameter of javac, and if that wasn't  
> present, the platform default.
>
> However, there is 1 major and 1 minor problem with this approach:
>
> B) This means javac will need to read every source file many times to  
> compile it.
>
> Worst case (no encoding keyword): 5 times.
> Standard case if an encoding keyword: 2 times (3 times if UTF-16).
>
> Fortunately all runs should stop quickly, due to the encoding parser's  
> penchant to quit very early. Javacs out there will either stuff the  
> entire source file into memory, or if not, disk cache should take care  
> of it, but we can't prove beyond a doubt that this repeated parsing  
> will have no significant impact on compile time. Is this a  
> showstopper? Is the need to include a new (but small) parser into  
> javac a showstopper?
>
> C) Certain character sets, such as ISO-2022, can make the encoding  
> statement unreadable with the standard strategy if a comment including  
> non-ASCII characters precedes the encoding statement. These situations  
> are very rare (in fact, I haven't managed to find an example), so is  
> it okay to just ignore this issue? If you add the encoding statement  
> after a bunch of comments that make it invisible, and then compile it  
> with the right -encoding parameter, you WILL get a warning that the  
> encoding statement isn't going to help a javac on another platform /  
> without that encoding parameter to figure it out, so you just get the  
> current status quo: your source file won't compile without an explicit  
> -encoding parameter (or if that happens to be the platform default).  
> Should this be mentioned in the proposal? Should the compiler (and the  
> proposal) put effort into generating a useful warning message, such as  
> figuring out if it WOULD parse correctly if the encoding statement is  
> at the very top of the source file, vs. suggesting to recode in UTF-8?
>
> and a final dilemma:
>
> D) Should we separate the proposals for source and encoding keywords?  
> The source keyword is more useful and a lot simpler overall than the  
> encoding keyword, but they do sort of go together.
>
> --Reinier Zwitserloot and Roel Spilker
>
>   




More information about the coin-dev mailing list