PRE-PROPOSAL: Source and Encoding keyword

Sat Mar 7 12:13:34 PST 2009

Replies to Neal Gafter, Jeremy Manson, Igor Karp, Vilya Harvey, and  
Stefan Schultz.

Jeremy Manson:

We don't discuss the API compatibilty because that's not what source  
does. That's what jigsaw might help with at some point. I don't really  
understand your complaint - it applies today as well - the "-source"  
parameter isn't a new invention, it's been part of javac for years. - 
source won't even set the class file format properly, that's what - 
target does. The "-source" parameter just doesn't guarantee producing  
a class file that works on old systems, and it also doesn't guarantee  
that the source file will compile on old systems. It's never done  
that. This proposal does not aim to solve that problem. After all,  
this is project coin. Not project Franklin, as Neal is so apt to  
remind us.

Igor, and to a lesser extent Stefan:

I get the feeling we're not understand each other. I dismantled just  
about every argument you replied.

An 'argfile' is obviously a 'build script'. See my previous post about  
why moving file-level meta-data into a build script is utterly  
unacceptable and doesn't have nearly the same advantages.

A major point you seem to ignore completely:

Meta-data (at least the source and encoding keyword) is part of  
*EVERY* java file already. If you give me a UTF-16 encoded file, that  
fact is part of that file, period. I can't compile it without knowing.  
Same with source compatibility. Making it an explicit part of the file  
(with our proposal) is therefore only good news.

To illustrate the problem I have with yoru arugment: I could argue  
with equal fervour that import statements ought to be part of a  
compiler switch or a separate meta-data file using your arguments.  
Which is clearly a bad idea.

Stefan:

Magic Comments is a very bad idea because it is first of all obviously  
a kludge, but it also sets a bad precedent: All java parsers out there  
now need to parse comments instead of skipping them, which causes very  
big differences in tokenizer architecture. The tokenizer is also not  
allowed to quit with an error when it can't parse a magic comment,  
because there's no telling if it might be some future magic comment  
syntax that was also introduced to avoid backward incompatibilities.  
It's technically not backwards compatible, and finally: it's ugly.  
Note that the proposal has already stated that encoding and source  
keywords must precede everything else except comments. You can't  
really demand that these precede all comments, because many tools add  
a comment at the top of every file (for e.g. a licence).

Vilya:

javac doesn't parse javadoc tags. It used to go in just to find any  
@deprecated, but java is moving away from that via the @Deprecated  
annotation. If anything, these could be annotations on class/interface  
declarations, but not everything in a java file IS a member. By that  
time we've passed the package statement, import statements, module- 
info and package-info directives, and other annotations on the first  
member (if any) in the source file. It's not flexible enough.

We've also just gone, on this mailing list, through a big argument  
where many contributors jumped on the notion of an annotation changing  
the way the compiler works (the ARM discussion). Clearly the tone has  
been set: Don't use annotations (and I think javadoc tags fall under  
the same distinction) to fundamentally change how the compiler works.

Neal:

I don't think I understand. All javacs have shipped with the -source  
parameter, and undoubtedly javac7 will be no exception. Adding the  
'source' keyword doesn't mean the JLS all of a sudden needs to embody  
every version of java that's every existed. The existing of the - 
source parameter doesn't create that requirement either, and there's  
no X in front of source, so the -source parameter is part of the java  
specification. There is going to be no rule in the JLS that states  
that a compatible compiler -must- support an enumerated list of  
versions. It just says: This is an alternative to the -source  
parameter. That's it. That's all. That's a tiny change. I just don't  
understand why you think this proposal requires a massive change to  
the JLS.

Even if a single javac invocation hits multiple files that have  
different source keywords in them, I also see no issue. Javac  
separately stubs out and compiles each file in the end; the only  
change between compiling every file one-by-one and compiling many at  
once is for co-dependencies.

The groovy/javac mixed compiler, which creates java stubs for groovy  
files first so a mixed groovy/java source base can be compiled even  
with many co-dependencies, proves this is an easy problem to solve.

Even if that is a bridge too far for project coin, the compiler always  
has an alternative available: Just Quit. If the compiler can't make a  
mixed source tree work in one go, just quit. With the appropriate  
error. This is still an improvement over a vomitous stream of warnings  
and errors. Tools will be built that use the stubbing approach to get  
the job done. Javac itself can integrate one of them in a minor point  
release.

  --Reinier Zwitserloot

On Mar 7, 2009, at 18:27, Igor Karp wrote:

> Reiner,
>
> please see the comments inline.
>
> On Fri, Mar 6, 2009 at 11:39 PM, Reinier Zwitserloot
> <reinier at zwitserloot.com> wrote:
>> Igor,
>>
>> how could the command line options be expanded? Allow -encoding to  
>> specify a
>> separate encoding for each file? I don't see how that can work.
> For example: allow multiple -encoding options and add optional path to
> encoding -encoding <encoding>[,<path>]
> Where path can be either a package (settings applied to the package
> and every package under it) or a single file for maximum precision.
> So one can have:
> -encoding X - encoding Y,a.b -encoding Z,a.b.c -encoding  
> X,a.b.c.d.IAmSpecial
> IAMSpecial.java will get encoding X,
> everything else under a.b.c will get encoding Z,
> everything else under a.b will get encoding Y
> and the rest will get encoding X.
> Same approach can be applied to -source.
>
>> There's no
>> way I or anyone else is going to edit a build script (be it just  
>> javac, a
>> home-rolled thing, ant, rake, make, maven, ivy, etcetera) to  
>> carefully
>> enumerate every file's source compatibility level.
> Sure, thats what argfiles are for: store the settings in a file and
> use javac @argfile.
>
> And doing it as proposed above on a package level would make it more  
> manageable.
> Remember in your proposal the only option is to specify it on a file
> level (this is fixable i guess).
>
>> Changing the command line
>> options also incurs the neccessary wrath of all those build tool  
>> developers
>> as they'd have to update their software to handle the new option  
>> (adding an
>> option is a change too!)
> Not more than changing the language itself.
>
>>
>> Could you also elaborate on why you don't like it? For example, how  
>> can the
>> benefits of having (more) portable source files, easier migration,  
>> and a
>> much cleaner solution to e.g. the assert-in-javac1.4 be achieved  
>> with e.g.
>> command line options, or do you not consider any of those worthwhile?
> I fully support the goal. I even see it as is a bit too narrow (see
> below). But I do not see a need to change the language to achieve that
> goal.
>
> On a conceptual level I see these options as a metadata of the source
> files and I don't like the idea of coupling it with the file.
> One can avoid all this complexity of extra parsing by specifying the
> encoding in an external file. This external file does not have
> itself to be in that encoding. In fact it can be restricted to be
> always in ASCII.
>
> I think the addition of an optional path and allowing multiple use of
> the same option approach is much more scalable: it could be extended
> to the other existing options (like -deprecation, -Xlint, etc.) and to
> the options that might appear in the future.
>
> I wish I could concentrate on deprecations in a certain package and
> ignore them everywhere else for now:
> javac -deprecation,really.rusty.one ...
> Finished with (or gave up on ;) that one and want to switch to the  
> next one:
> javac -deprecation,another.old.one
>
> Igor Karp
>
>>
>> As an aside, how do people approach project coin submissions? I  
>> tend to look
>> at a proposal's value, which is its benefit divided by the  
>> disadvantages
>> (end-programmer complexity to learn, amount of changes needed to  
>> javac
>> and/or JVM, and restrictions on potential future expansions). One  
>> of the
>> reasons I'm writing this up with Roel is because the disadvantages  
>> seemed to
>> be almost nonexistent on the outset (the encoding stuff made it more
>> complicated, but at least the complication is entirely hidden from  
>> java
>> developer's eyes, so it value proposal is still aces in my book).  
>> If there's
>> a goal to keep the total language changes, no matter how simple  
>> they are,
>> down to a small set, then benefit regardless of disadvantages is  
>> the better
>> yardstick.
>>
>>  --Reinier Zwitserloot
>>
>>
>>
>> On Mar 7, 2009, at 08:15, Igor Karp wrote:
>>
>>> On Fri, Mar 6, 2009 at 10:03 PM, Reinier Zwitserloot
>>> <reinier at zwitserloot.com> wrote:
>>>>
>>>> We have written up a proposal for adding a 'source' and 'encoding'
>>>> keyword (alternatives to the -source and -encoding keywords on the
>>>> command line; they work pretty much just as you expect). The  
>>>> keywords
>>>> are context sensitive and must both appear before anything else  
>>>> other
>>>> than comments to be parsed. In case the benefit isn't obvious: It  
>>>> is a
>>>> great help when you are trying to port a big project to a new  
>>>> source
>>>> language compatibility. Leaving half your sourcebase in v1.6 and  
>>>> the
>>>> other half in v1.7 is pretty much impossible today, it's all-or-
>>>> nothing. It should also be a much nicer solution to the 'assert in
>>>> v1.4' dilemma, which I guess is going to happen to v1.7 as well,  
>>>> given
>>>> that 'module' is most likely going to become a keyword. Finally, it
>>>> makes java files a lot more portable; you no longer run into your
>>>> strings looking weird when you move your Windows-1252 codefile java
>>>> source to a mac, for example.
>>>>
>>>> Before we finish it though, some open questions we'd like some
>>>> feedback on:
>>>>
>>>> A) Technically, starting a file with "source 1.4" is obviously  
>>>> silly;
>>>> javac v1.4 doesn't know about the source keyword and would thus  
>>>> fail
>>>> immediately. However, practically, its still useful. Example: if
>>>> you've mostly converted a GWT project to GWT 1.5 (which uses java  
>>>> 1.5
>>>> syntax), but have a few files remaining on GWT v1.4 (which uses  
>>>> java
>>>> 1.4 syntax), then tossing a "source 1.4;" in those older files
>>>> eliminates all the generics warnings and serves as a reminder  
>>>> that you
>>>> should still convert those at some point. However, it isn't - 
>>>> actually-
>>>> compatible with a real javac 1.4. We're leaning to making "source
>>>> 1.6;"  (and below) legal even when using a javac v1.7 or above, but
>>>> perhaps that's a bridge too far? We could go with magic comments  
>>>> but
>>>> that seems like a very bad solution.
>>>>
>>>> also:
>>>>
>>>> Encoding is rather a hairy issue; javac will need to read the  
>>>> file to
>>>> find the encoding, but to read a file, it needs to know about
>>>> encoding! Fortunately, *every single* popular encoding on  
>>>> wikipedia's
>>>> popular encoding list at:
>>>>
>>>>
>>>> http://en.wikipedia.org/wiki/Character_encoding#Popular_character_encodings
>>>>
>>>> will encode "encoding own-name-in-that-encoding;" the same as ASCII
>>>> would, except for KOI-7 and UTF-7, (both 7 bit encodings that I  
>>>> doubt
>>>> anyone ever uses to program java).
>>>>
>>>> Therefore, the proposal includes the following strategy to find the
>>>> encoding statement in a java source file without knowing the  
>>>> encoding
>>>> beforehand:
>>>>
>>>> An entirely separate parser (the encoding parser) is run repeatedly
>>>> until the right encoding is found. First it'll decode the input  
>>>> with
>>>> ISO-8859-1. If that doesn't work, UTF-16 (assume BE if no BOM, as  
>>>> per
>>>> the java standard), then as UTF-32 (BE if no BOM), then the current
>>>> behaviour (-encoding parameter's value if any, otherwise platform
>>>> default encoding). This separate parser works as follows:
>>>>
>>>> 1. Ignore any comments and whitespace.
>>>> 3. Ignore the pattern (regexp-like-syntax, ): source\s+[^\s]+\s*;  
>>>> - if
>>>> that pattern matches partially but is not correctly completed, that
>>>> parser run exits without finding an encoding, immediately.
>>>> 4. Find the pattern: encoding\s+([^\s]+)\s*; - if that pattern  
>>>> matches
>>>> partially but is not correctly completed, that parser run exists
>>>> without finding an encoding, immediately. If it does complete, the
>>>> parser also exists immediately and returns the captured value.
>>>> 5. If it finds anything else, stop immediately, returning no  
>>>> encoding
>>>> found.
>>>>
>>>> Once it's found something, the 'real' java parser will run using  
>>>> the
>>>> found encoding (this overrides any -encoding on the command line).
>>>> Note that the encoding parser stops quickly; For example, if it  
>>>> finds
>>>> a stray \0 or e.g. the letter 'i' (perhaps the first letter of an
>>>> import statement), it'll stop immediately.
>>>>
>>>> If an encoding is encountered that was not found during the  
>>>> standard
>>>> decoding strategy (ISO-8859-1, UTF-16, UTF-32), but worked only  
>>>> due to
>>>> a platform default/command line encoding param, (e.g. a platform  
>>>> that
>>>> defaults to UTF-16LE without a byte order mark) a warning  
>>>> explaining
>>>> that the encoding statement isn't doing anything is generated. Of
>>>> course, if the encoding doesn't match itself, you get an error
>>>> (putting "encoding UTF-16;" into a UTF-8 encoded file for  
>>>> example). If
>>>> there is no encoding statement, the 'real' java parser does what it
>>>> does now: Use the -encoding parameter of javac, and if that wasn't
>>>> present, the platform default.
>>>>
>>>> However, there is 1 major and 1 minor problem with this approach:
>>>>
>>>> B) This means javac will need to read every source file many  
>>>> times to
>>>> compile it.
>>>>
>>>> Worst case (no encoding keyword): 5 times.
>>>> Standard case if an encoding keyword: 2 times (3 times if UTF-16).
>>>>
>>>> Fortunately all runs should stop quickly, due to the encoding  
>>>> parser's
>>>> penchant to quit very early. Javacs out there will either stuff the
>>>> entire source file into memory, or if not, disk cache should take  
>>>> care
>>>> of it, but we can't prove beyond a doubt that this repeated parsing
>>>> will have no significant impact on compile time. Is this a
>>>> showstopper? Is the need to include a new (but small) parser into
>>>> javac a showstopper?
>>>>
>>>> C) Certain character sets, such as ISO-2022, can make the encoding
>>>> statement unreadable with the standard strategy if a comment  
>>>> including
>>>> non-ASCII characters precedes the encoding statement. These  
>>>> situations
>>>> are very rare (in fact, I haven't managed to find an example), so  
>>>> is
>>>> it okay to just ignore this issue? If you add the encoding  
>>>> statement
>>>> after a bunch of comments that make it invisible, and then  
>>>> compile it
>>>> with the right -encoding parameter, you WILL get a warning that the
>>>> encoding statement isn't going to help a javac on another  
>>>> platform /
>>>> without that encoding parameter to figure it out, so you just get  
>>>> the
>>>> current status quo: your source file won't compile without an  
>>>> explicit
>>>> -encoding parameter (or if that happens to be the platform  
>>>> default).
>>>> Should this be mentioned in the proposal? Should the compiler  
>>>> (and the
>>>> proposal) put effort into generating a useful warning message,  
>>>> such as
>>>> figuring out if it WOULD parse correctly if the encoding  
>>>> statement is
>>>> at the very top of the source file, vs. suggesting to recode in  
>>>> UTF-8?
>>>>
>>>> and a final dilemma:
>>>>
>>>> D) Should we separate the proposals for source and encoding  
>>>> keywords?
>>>> The source keyword is more useful and a lot simpler overall than  
>>>> the
>>>> encoding keyword, but they do sort of go together.
>>>
>>> Separate. Another reason is: the argument of applying different  
>>> settings
>>> to
>>> different parts of the project is much less valid with encoding than
>>> with source.
>>>
>>>>
>>>> --Reinier Zwitserloot and Roel Spilker
>>>>
>>>>
>>> Overall: I would prefer command line options enhanced to handle the
>>> situation
>>> rather than language change.
>>>
>>> Igor Karp
>>
>>