PRE-PROPOSAL: Source and Encoding keyword

Fri Mar 6 23:39:57 PST 2009

Igor,

how could the command line options be expanded? Allow -encoding to  
specify a separate encoding for each file? I don't see how that can  
work. There's no way I or anyone else is going to edit a build script  
(be it just javac, a home-rolled thing, ant, rake, make, maven, ivy,  
etcetera) to carefully enumerate every file's source compatibility  
level. Changing the command line options also incurs the neccessary  
wrath of all those build tool developers as they'd have to update  
their software to handle the new option (adding an option is a change  
too!)

Could you also elaborate on why you don't like it? For example, how  
can the benefits of having (more) portable source files, easier  
migration, and a much cleaner solution to e.g. the assert-in-javac1.4  
be achieved with e.g. command line options, or do you not consider any  
of those worthwhile?

As an aside, how do people approach project coin submissions? I tend  
to look at a proposal's value, which is its benefit divided by the  
disadvantages (end-programmer complexity to learn, amount of changes  
needed to javac and/or JVM, and restrictions on potential future  
expansions). One of the reasons I'm writing this up with Roel is  
because the disadvantages seemed to be almost nonexistent on the  
outset (the encoding stuff made it more complicated, but at least the  
complication is entirely hidden from java developer's eyes, so it  
value proposal is still aces in my book). If there's a goal to keep  
the total language changes, no matter how simple they are, down to a  
small set, then benefit regardless of disadvantages is the better  
yardstick.

  --Reinier Zwitserloot

On Mar 7, 2009, at 08:15, Igor Karp wrote:

> On Fri, Mar 6, 2009 at 10:03 PM, Reinier Zwitserloot
> <reinier at zwitserloot.com> wrote:
>> We have written up a proposal for adding a 'source' and 'encoding'
>> keyword (alternatives to the -source and -encoding keywords on the
>> command line; they work pretty much just as you expect). The keywords
>> are context sensitive and must both appear before anything else other
>> than comments to be parsed. In case the benefit isn't obvious: It  
>> is a
>> great help when you are trying to port a big project to a new source
>> language compatibility. Leaving half your sourcebase in v1.6 and the
>> other half in v1.7 is pretty much impossible today, it's all-or-
>> nothing. It should also be a much nicer solution to the 'assert in
>> v1.4' dilemma, which I guess is going to happen to v1.7 as well,  
>> given
>> that 'module' is most likely going to become a keyword. Finally, it
>> makes java files a lot more portable; you no longer run into your
>> strings looking weird when you move your Windows-1252 codefile java
>> source to a mac, for example.
>>
>> Before we finish it though, some open questions we'd like some
>> feedback on:
>>
>> A) Technically, starting a file with "source 1.4" is obviously silly;
>> javac v1.4 doesn't know about the source keyword and would thus fail
>> immediately. However, practically, its still useful. Example: if
>> you've mostly converted a GWT project to GWT 1.5 (which uses java 1.5
>> syntax), but have a few files remaining on GWT v1.4 (which uses java
>> 1.4 syntax), then tossing a "source 1.4;" in those older files
>> eliminates all the generics warnings and serves as a reminder that  
>> you
>> should still convert those at some point. However, it isn't - 
>> actually-
>> compatible with a real javac 1.4. We're leaning to making "source
>> 1.6;"  (and below) legal even when using a javac v1.7 or above, but
>> perhaps that's a bridge too far? We could go with magic comments but
>> that seems like a very bad solution.
>>
>> also:
>>
>> Encoding is rather a hairy issue; javac will need to read the file to
>> find the encoding, but to read a file, it needs to know about
>> encoding! Fortunately, *every single* popular encoding on wikipedia's
>> popular encoding list at:
>>
>> http://en.wikipedia.org/wiki/Character_encoding#Popular_character_encodings
>>
>> will encode "encoding own-name-in-that-encoding;" the same as ASCII
>> would, except for KOI-7 and UTF-7, (both 7 bit encodings that I doubt
>> anyone ever uses to program java).
>>
>> Therefore, the proposal includes the following strategy to find the
>> encoding statement in a java source file without knowing the encoding
>> beforehand:
>>
>> An entirely separate parser (the encoding parser) is run repeatedly
>> until the right encoding is found. First it'll decode the input with
>> ISO-8859-1. If that doesn't work, UTF-16 (assume BE if no BOM, as per
>> the java standard), then as UTF-32 (BE if no BOM), then the current
>> behaviour (-encoding parameter's value if any, otherwise platform
>> default encoding). This separate parser works as follows:
>>
>> 1. Ignore any comments and whitespace.
>> 3. Ignore the pattern (regexp-like-syntax, ): source\s+[^\s]+\s*; -  
>> if
>> that pattern matches partially but is not correctly completed, that
>> parser run exits without finding an encoding, immediately.
>> 4. Find the pattern: encoding\s+([^\s]+)\s*; - if that pattern  
>> matches
>> partially but is not correctly completed, that parser run exists
>> without finding an encoding, immediately. If it does complete, the
>> parser also exists immediately and returns the captured value.
>> 5. If it finds anything else, stop immediately, returning no encoding
>> found.
>>
>> Once it's found something, the 'real' java parser will run using the
>> found encoding (this overrides any -encoding on the command line).
>> Note that the encoding parser stops quickly; For example, if it finds
>> a stray \0 or e.g. the letter 'i' (perhaps the first letter of an
>> import statement), it'll stop immediately.
>>
>> If an encoding is encountered that was not found during the standard
>> decoding strategy (ISO-8859-1, UTF-16, UTF-32), but worked only due  
>> to
>> a platform default/command line encoding param, (e.g. a platform that
>> defaults to UTF-16LE without a byte order mark) a warning explaining
>> that the encoding statement isn't doing anything is generated. Of
>> course, if the encoding doesn't match itself, you get an error
>> (putting "encoding UTF-16;" into a UTF-8 encoded file for example).  
>> If
>> there is no encoding statement, the 'real' java parser does what it
>> does now: Use the -encoding parameter of javac, and if that wasn't
>> present, the platform default.
>>
>> However, there is 1 major and 1 minor problem with this approach:
>>
>> B) This means javac will need to read every source file many times to
>> compile it.
>>
>> Worst case (no encoding keyword): 5 times.
>> Standard case if an encoding keyword: 2 times (3 times if UTF-16).
>>
>> Fortunately all runs should stop quickly, due to the encoding  
>> parser's
>> penchant to quit very early. Javacs out there will either stuff the
>> entire source file into memory, or if not, disk cache should take  
>> care
>> of it, but we can't prove beyond a doubt that this repeated parsing
>> will have no significant impact on compile time. Is this a
>> showstopper? Is the need to include a new (but small) parser into
>> javac a showstopper?
>>
>> C) Certain character sets, such as ISO-2022, can make the encoding
>> statement unreadable with the standard strategy if a comment  
>> including
>> non-ASCII characters precedes the encoding statement. These  
>> situations
>> are very rare (in fact, I haven't managed to find an example), so is
>> it okay to just ignore this issue? If you add the encoding statement
>> after a bunch of comments that make it invisible, and then compile it
>> with the right -encoding parameter, you WILL get a warning that the
>> encoding statement isn't going to help a javac on another platform /
>> without that encoding parameter to figure it out, so you just get the
>> current status quo: your source file won't compile without an  
>> explicit
>> -encoding parameter (or if that happens to be the platform default).
>> Should this be mentioned in the proposal? Should the compiler (and  
>> the
>> proposal) put effort into generating a useful warning message, such  
>> as
>> figuring out if it WOULD parse correctly if the encoding statement is
>> at the very top of the source file, vs. suggesting to recode in  
>> UTF-8?
>>
>> and a final dilemma:
>>
>> D) Should we separate the proposals for source and encoding keywords?
>> The source keyword is more useful and a lot simpler overall than the
>> encoding keyword, but they do sort of go together.
> Separate. Another reason is: the argument of applying different  
> settings to
> different parts of the project is much less valid with encoding than
> with source.
>
>>
>> --Reinier Zwitserloot and Roel Spilker
>>
>>
> Overall: I would prefer command line options enhanced to handle the  
> situation
> rather than language change.
>
> Igor Karp