PRE-PROPOSAL: Source and Encoding keyword

Fri Mar 6 22:03:25 PST 2009

We have written up a proposal for adding a 'source' and 'encoding'  
keyword (alternatives to the -source and -encoding keywords on the  
command line; they work pretty much just as you expect). The keywords  
are context sensitive and must both appear before anything else other  
than comments to be parsed. In case the benefit isn't obvious: It is a  
great help when you are trying to port a big project to a new source  
language compatibility. Leaving half your sourcebase in v1.6 and the  
other half in v1.7 is pretty much impossible today, it's all-or- 
nothing. It should also be a much nicer solution to the 'assert in  
v1.4' dilemma, which I guess is going to happen to v1.7 as well, given  
that 'module' is most likely going to become a keyword. Finally, it  
makes java files a lot more portable; you no longer run into your  
strings looking weird when you move your Windows-1252 codefile java  
source to a mac, for example.

Before we finish it though, some open questions we'd like some  
feedback on:

A) Technically, starting a file with "source 1.4" is obviously silly;  
javac v1.4 doesn't know about the source keyword and would thus fail  
immediately. However, practically, its still useful. Example: if  
you've mostly converted a GWT project to GWT 1.5 (which uses java 1.5  
syntax), but have a few files remaining on GWT v1.4 (which uses java  
1.4 syntax), then tossing a "source 1.4;" in those older files  
eliminates all the generics warnings and serves as a reminder that you  
should still convert those at some point. However, it isn't -actually-  
compatible with a real javac 1.4. We're leaning to making "source  
1.6;"  (and below) legal even when using a javac v1.7 or above, but  
perhaps that's a bridge too far? We could go with magic comments but  
that seems like a very bad solution.

also:

Encoding is rather a hairy issue; javac will need to read the file to  
find the encoding, but to read a file, it needs to know about  
encoding! Fortunately, *every single* popular encoding on wikipedia's  
popular encoding list at:

http://en.wikipedia.org/wiki/Character_encoding#Popular_character_encodings

will encode "encoding own-name-in-that-encoding;" the same as ASCII  
would, except for KOI-7 and UTF-7, (both 7 bit encodings that I doubt  
anyone ever uses to program java).

Therefore, the proposal includes the following strategy to find the  
encoding statement in a java source file without knowing the encoding  
beforehand:

An entirely separate parser (the encoding parser) is run repeatedly  
until the right encoding is found. First it'll decode the input with  
ISO-8859-1. If that doesn't work, UTF-16 (assume BE if no BOM, as per  
the java standard), then as UTF-32 (BE if no BOM), then the current  
behaviour (-encoding parameter's value if any, otherwise platform  
default encoding). This separate parser works as follows:

1. Ignore any comments and whitespace.
3. Ignore the pattern (regexp-like-syntax, ): source\s+[^\s]+\s*; - if  
that pattern matches partially but is not correctly completed, that  
parser run exits without finding an encoding, immediately.
4. Find the pattern: encoding\s+([^\s]+)\s*; - if that pattern matches  
partially but is not correctly completed, that parser run exists  
without finding an encoding, immediately. If it does complete, the  
parser also exists immediately and returns the captured value.
5. If it finds anything else, stop immediately, returning no encoding  
found.

Once it's found something, the 'real' java parser will run using the  
found encoding (this overrides any -encoding on the command line).  
Note that the encoding parser stops quickly; For example, if it finds  
a stray \0 or e.g. the letter 'i' (perhaps the first letter of an  
import statement), it'll stop immediately.

If an encoding is encountered that was not found during the standard  
decoding strategy (ISO-8859-1, UTF-16, UTF-32), but worked only due to  
a platform default/command line encoding param, (e.g. a platform that  
defaults to UTF-16LE without a byte order mark) a warning explaining  
that the encoding statement isn't doing anything is generated. Of  
course, if the encoding doesn't match itself, you get an error  
(putting "encoding UTF-16;" into a UTF-8 encoded file for example). If  
there is no encoding statement, the 'real' java parser does what it  
does now: Use the -encoding parameter of javac, and if that wasn't  
present, the platform default.

However, there is 1 major and 1 minor problem with this approach:

B) This means javac will need to read every source file many times to  
compile it.

Worst case (no encoding keyword): 5 times.
Standard case if an encoding keyword: 2 times (3 times if UTF-16).

Fortunately all runs should stop quickly, due to the encoding parser's  
penchant to quit very early. Javacs out there will either stuff the  
entire source file into memory, or if not, disk cache should take care  
of it, but we can't prove beyond a doubt that this repeated parsing  
will have no significant impact on compile time. Is this a  
showstopper? Is the need to include a new (but small) parser into  
javac a showstopper?

C) Certain character sets, such as ISO-2022, can make the encoding  
statement unreadable with the standard strategy if a comment including  
non-ASCII characters precedes the encoding statement. These situations  
are very rare (in fact, I haven't managed to find an example), so is  
it okay to just ignore this issue? If you add the encoding statement  
after a bunch of comments that make it invisible, and then compile it  
with the right -encoding parameter, you WILL get a warning that the  
encoding statement isn't going to help a javac on another platform /  
without that encoding parameter to figure it out, so you just get the  
current status quo: your source file won't compile without an explicit  
-encoding parameter (or if that happens to be the platform default).  
Should this be mentioned in the proposal? Should the compiler (and the  
proposal) put effort into generating a useful warning message, such as  
figuring out if it WOULD parse correctly if the encoding statement is  
at the very top of the source file, vs. suggesting to recode in UTF-8?

and a final dilemma:

D) Should we separate the proposals for source and encoding keywords?  
The source keyword is more useful and a lot simpler overall than the  
encoding keyword, but they do sort of go together.

--Reinier Zwitserloot and Roel Spilker