PRE-PROPOSAL: Source and Encoding keyword

Sat Mar 7 03:22:27 PST 2009

Instead of magic comments, how about defining these as javadoc tags? That
would provide a way to specify the info for whole packages, as well as
individual classes; it wouldn't require any syntax changes either, so would
be backwards compatible. The downside is that it would require adding
support for javadoc comments at the file level.

Vil.

2009/3/7 Stefan Schulz <schulz at e-spirit.de>

> My first reaction on both issues was to have these information in an
> external file. But then again you'd either have to ship additional files
> or will lose the information.
> I know, magic comments usually are somehow not desired, but in this
> case, I'd see the advantage for it (backward compatibility). Especially,
> as it IMO is no change in the Java language,  but rather enables inline,
> file-scoped compiler options. For example:
>
> //%encoding ISO 8859-1%//
> //%source 1.4%//
>
> Requiring such statements to be placed before ANY other statement in the
> file would also speed up their detection.
>
> Stefan
>
> Reinier Zwitserloot schrieb:
> > We have written up a proposal for adding a 'source' and 'encoding'
> > keyword (alternatives to the -source and -encoding keywords on the
> > command line; they work pretty much just as you expect). The keywords
> > are context sensitive and must both appear before anything else other
> > than comments to be parsed. In case the benefit isn't obvious: It is a
> > great help when you are trying to port a big project to a new source
> > language compatibility. Leaving half your sourcebase in v1.6 and the
> > other half in v1.7 is pretty much impossible today, it's all-or-
> > nothing. It should also be a much nicer solution to the 'assert in
> > v1.4' dilemma, which I guess is going to happen to v1.7 as well, given
> > that 'module' is most likely going to become a keyword. Finally, it
> > makes java files a lot more portable; you no longer run into your
> > strings looking weird when you move your Windows-1252 codefile java
> > source to a mac, for example.
> >
> > Before we finish it though, some open questions we'd like some
> > feedback on:
> >
> > A) Technically, starting a file with "source 1.4" is obviously silly;
> > javac v1.4 doesn't know about the source keyword and would thus fail
> > immediately. However, practically, its still useful. Example: if
> > you've mostly converted a GWT project to GWT 1.5 (which uses java 1.5
> > syntax), but have a few files remaining on GWT v1.4 (which uses java
> > 1.4 syntax), then tossing a "source 1.4;" in those older files
> > eliminates all the generics warnings and serves as a reminder that you
> > should still convert those at some point. However, it isn't -actually-
> > compatible with a real javac 1.4. We're leaning to making "source
> > 1.6;"  (and below) legal even when using a javac v1.7 or above, but
> > perhaps that's a bridge too far? We could go with magic comments but
> > that seems like a very bad solution.
> >
> > also:
> >
> > Encoding is rather a hairy issue; javac will need to read the file to
> > find the encoding, but to read a file, it needs to know about
> > encoding! Fortunately, *every single* popular encoding on wikipedia's
> > popular encoding list at:
> >
> >
> http://en.wikipedia.org/wiki/Character_encoding#Popular_character_encodings
> >
> > will encode "encoding own-name-in-that-encoding;" the same as ASCII
> > would, except for KOI-7 and UTF-7, (both 7 bit encodings that I doubt
> > anyone ever uses to program java).
> >
> > Therefore, the proposal includes the following strategy to find the
> > encoding statement in a java source file without knowing the encoding
> > beforehand:
> >
> > An entirely separate parser (the encoding parser) is run repeatedly
> > until the right encoding is found. First it'll decode the input with
> > ISO-8859-1. If that doesn't work, UTF-16 (assume BE if no BOM, as per
> > the java standard), then as UTF-32 (BE if no BOM), then the current
> > behaviour (-encoding parameter's value if any, otherwise platform
> > default encoding). This separate parser works as follows:
> >
> > 1. Ignore any comments and whitespace.
> > 3. Ignore the pattern (regexp-like-syntax, ): source\s+[^\s]+\s*; - if
> > that pattern matches partially but is not correctly completed, that
> > parser run exits without finding an encoding, immediately.
> > 4. Find the pattern: encoding\s+([^\s]+)\s*; - if that pattern matches
> > partially but is not correctly completed, that parser run exists
> > without finding an encoding, immediately. If it does complete, the
> > parser also exists immediately and returns the captured value.
> > 5. If it finds anything else, stop immediately, returning no encoding
> > found.
> >
> > Once it's found something, the 'real' java parser will run using the
> > found encoding (this overrides any -encoding on the command line).
> > Note that the encoding parser stops quickly; For example, if it finds
> > a stray \0 or e.g. the letter 'i' (perhaps the first letter of an
> > import statement), it'll stop immediately.
> >
> > If an encoding is encountered that was not found during the standard
> > decoding strategy (ISO-8859-1, UTF-16, UTF-32), but worked only due to
> > a platform default/command line encoding param, (e.g. a platform that
> > defaults to UTF-16LE without a byte order mark) a warning explaining
> > that the encoding statement isn't doing anything is generated. Of
> > course, if the encoding doesn't match itself, you get an error
> > (putting "encoding UTF-16;" into a UTF-8 encoded file for example). If
> > there is no encoding statement, the 'real' java parser does what it
> > does now: Use the -encoding parameter of javac, and if that wasn't
> > present, the platform default.
> >
> > However, there is 1 major and 1 minor problem with this approach:
> >
> > B) This means javac will need to read every source file many times to
> > compile it.
> >
> > Worst case (no encoding keyword): 5 times.
> > Standard case if an encoding keyword: 2 times (3 times if UTF-16).
> >
> > Fortunately all runs should stop quickly, due to the encoding parser's
> > penchant to quit very early. Javacs out there will either stuff the
> > entire source file into memory, or if not, disk cache should take care
> > of it, but we can't prove beyond a doubt that this repeated parsing
> > will have no significant impact on compile time. Is this a
> > showstopper? Is the need to include a new (but small) parser into
> > javac a showstopper?
> >
> > C) Certain character sets, such as ISO-2022, can make the encoding
> > statement unreadable with the standard strategy if a comment including
> > non-ASCII characters precedes the encoding statement. These situations
> > are very rare (in fact, I haven't managed to find an example), so is
> > it okay to just ignore this issue? If you add the encoding statement
> > after a bunch of comments that make it invisible, and then compile it
> > with the right -encoding parameter, you WILL get a warning that the
> > encoding statement isn't going to help a javac on another platform /
> > without that encoding parameter to figure it out, so you just get the
> > current status quo: your source file won't compile without an explicit
> > -encoding parameter (or if that happens to be the platform default).
> > Should this be mentioned in the proposal? Should the compiler (and the
> > proposal) put effort into generating a useful warning message, such as
> > figuring out if it WOULD parse correctly if the encoding statement is
> > at the very top of the source file, vs. suggesting to recode in UTF-8?
> >
> > and a final dilemma:
> >
> > D) Should we separate the proposals for source and encoding keywords?
> > The source keyword is more useful and a lot simpler overall than the
> > encoding keyword, but they do sort of go together.
> >
> > --Reinier Zwitserloot and Roel Spilker
> >
> >
>
>