PRE-PROPOSAL: Source and Encoding keyword
Reinier Zwitserloot
reinier at zwitserloot.com
Sat Mar 7 12:13:34 PST 2009
Replies to Neal Gafter, Jeremy Manson, Igor Karp, Vilya Harvey, and
Stefan Schultz.
Jeremy Manson:
We don't discuss the API compatibilty because that's not what source
does. That's what jigsaw might help with at some point. I don't really
understand your complaint - it applies today as well - the "-source"
parameter isn't a new invention, it's been part of javac for years. -
source won't even set the class file format properly, that's what -
target does. The "-source" parameter just doesn't guarantee producing
a class file that works on old systems, and it also doesn't guarantee
that the source file will compile on old systems. It's never done
that. This proposal does not aim to solve that problem. After all,
this is project coin. Not project Franklin, as Neal is so apt to
remind us.
Igor, and to a lesser extent Stefan:
I get the feeling we're not understand each other. I dismantled just
about every argument you replied.
An 'argfile' is obviously a 'build script'. See my previous post about
why moving file-level meta-data into a build script is utterly
unacceptable and doesn't have nearly the same advantages.
A major point you seem to ignore completely:
Meta-data (at least the source and encoding keyword) is part of
*EVERY* java file already. If you give me a UTF-16 encoded file, that
fact is part of that file, period. I can't compile it without knowing.
Same with source compatibility. Making it an explicit part of the file
(with our proposal) is therefore only good news.
To illustrate the problem I have with yoru arugment: I could argue
with equal fervour that import statements ought to be part of a
compiler switch or a separate meta-data file using your arguments.
Which is clearly a bad idea.
Stefan:
Magic Comments is a very bad idea because it is first of all obviously
a kludge, but it also sets a bad precedent: All java parsers out there
now need to parse comments instead of skipping them, which causes very
big differences in tokenizer architecture. The tokenizer is also not
allowed to quit with an error when it can't parse a magic comment,
because there's no telling if it might be some future magic comment
syntax that was also introduced to avoid backward incompatibilities.
It's technically not backwards compatible, and finally: it's ugly.
Note that the proposal has already stated that encoding and source
keywords must precede everything else except comments. You can't
really demand that these precede all comments, because many tools add
a comment at the top of every file (for e.g. a licence).
Vilya:
javac doesn't parse javadoc tags. It used to go in just to find any
@deprecated, but java is moving away from that via the @Deprecated
annotation. If anything, these could be annotations on class/interface
declarations, but not everything in a java file IS a member. By that
time we've passed the package statement, import statements, module-
info and package-info directives, and other annotations on the first
member (if any) in the source file. It's not flexible enough.
We've also just gone, on this mailing list, through a big argument
where many contributors jumped on the notion of an annotation changing
the way the compiler works (the ARM discussion). Clearly the tone has
been set: Don't use annotations (and I think javadoc tags fall under
the same distinction) to fundamentally change how the compiler works.
Neal:
I don't think I understand. All javacs have shipped with the -source
parameter, and undoubtedly javac7 will be no exception. Adding the
'source' keyword doesn't mean the JLS all of a sudden needs to embody
every version of java that's every existed. The existing of the -
source parameter doesn't create that requirement either, and there's
no X in front of source, so the -source parameter is part of the java
specification. There is going to be no rule in the JLS that states
that a compatible compiler -must- support an enumerated list of
versions. It just says: This is an alternative to the -source
parameter. That's it. That's all. That's a tiny change. I just don't
understand why you think this proposal requires a massive change to
the JLS.
Even if a single javac invocation hits multiple files that have
different source keywords in them, I also see no issue. Javac
separately stubs out and compiles each file in the end; the only
change between compiling every file one-by-one and compiling many at
once is for co-dependencies.
The groovy/javac mixed compiler, which creates java stubs for groovy
files first so a mixed groovy/java source base can be compiled even
with many co-dependencies, proves this is an easy problem to solve.
Even if that is a bridge too far for project coin, the compiler always
has an alternative available: Just Quit. If the compiler can't make a
mixed source tree work in one go, just quit. With the appropriate
error. This is still an improvement over a vomitous stream of warnings
and errors. Tools will be built that use the stubbing approach to get
the job done. Javac itself can integrate one of them in a minor point
release.
--Reinier Zwitserloot
On Mar 7, 2009, at 18:27, Igor Karp wrote:
> Reiner,
>
> please see the comments inline.
>
> On Fri, Mar 6, 2009 at 11:39 PM, Reinier Zwitserloot
> <reinier at zwitserloot.com> wrote:
>> Igor,
>>
>> how could the command line options be expanded? Allow -encoding to
>> specify a
>> separate encoding for each file? I don't see how that can work.
> For example: allow multiple -encoding options and add optional path to
> encoding -encoding <encoding>[,<path>]
> Where path can be either a package (settings applied to the package
> and every package under it) or a single file for maximum precision.
> So one can have:
> -encoding X - encoding Y,a.b -encoding Z,a.b.c -encoding
> X,a.b.c.d.IAmSpecial
> IAMSpecial.java will get encoding X,
> everything else under a.b.c will get encoding Z,
> everything else under a.b will get encoding Y
> and the rest will get encoding X.
> Same approach can be applied to -source.
>
>> There's no
>> way I or anyone else is going to edit a build script (be it just
>> javac, a
>> home-rolled thing, ant, rake, make, maven, ivy, etcetera) to
>> carefully
>> enumerate every file's source compatibility level.
> Sure, thats what argfiles are for: store the settings in a file and
> use javac @argfile.
>
> And doing it as proposed above on a package level would make it more
> manageable.
> Remember in your proposal the only option is to specify it on a file
> level (this is fixable i guess).
>
>> Changing the command line
>> options also incurs the neccessary wrath of all those build tool
>> developers
>> as they'd have to update their software to handle the new option
>> (adding an
>> option is a change too!)
> Not more than changing the language itself.
>
>>
>> Could you also elaborate on why you don't like it? For example, how
>> can the
>> benefits of having (more) portable source files, easier migration,
>> and a
>> much cleaner solution to e.g. the assert-in-javac1.4 be achieved
>> with e.g.
>> command line options, or do you not consider any of those worthwhile?
> I fully support the goal. I even see it as is a bit too narrow (see
> below). But I do not see a need to change the language to achieve that
> goal.
>
> On a conceptual level I see these options as a metadata of the source
> files and I don't like the idea of coupling it with the file.
> One can avoid all this complexity of extra parsing by specifying the
> encoding in an external file. This external file does not have
> itself to be in that encoding. In fact it can be restricted to be
> always in ASCII.
>
> I think the addition of an optional path and allowing multiple use of
> the same option approach is much more scalable: it could be extended
> to the other existing options (like -deprecation, -Xlint, etc.) and to
> the options that might appear in the future.
>
> I wish I could concentrate on deprecations in a certain package and
> ignore them everywhere else for now:
> javac -deprecation,really.rusty.one ...
> Finished with (or gave up on ;) that one and want to switch to the
> next one:
> javac -deprecation,another.old.one
>
> Igor Karp
>
>>
>> As an aside, how do people approach project coin submissions? I
>> tend to look
>> at a proposal's value, which is its benefit divided by the
>> disadvantages
>> (end-programmer complexity to learn, amount of changes needed to
>> javac
>> and/or JVM, and restrictions on potential future expansions). One
>> of the
>> reasons I'm writing this up with Roel is because the disadvantages
>> seemed to
>> be almost nonexistent on the outset (the encoding stuff made it more
>> complicated, but at least the complication is entirely hidden from
>> java
>> developer's eyes, so it value proposal is still aces in my book).
>> If there's
>> a goal to keep the total language changes, no matter how simple
>> they are,
>> down to a small set, then benefit regardless of disadvantages is
>> the better
>> yardstick.
>>
>> --Reinier Zwitserloot
>>
>>
>>
>> On Mar 7, 2009, at 08:15, Igor Karp wrote:
>>
>>> On Fri, Mar 6, 2009 at 10:03 PM, Reinier Zwitserloot
>>> <reinier at zwitserloot.com> wrote:
>>>>
>>>> We have written up a proposal for adding a 'source' and 'encoding'
>>>> keyword (alternatives to the -source and -encoding keywords on the
>>>> command line; they work pretty much just as you expect). The
>>>> keywords
>>>> are context sensitive and must both appear before anything else
>>>> other
>>>> than comments to be parsed. In case the benefit isn't obvious: It
>>>> is a
>>>> great help when you are trying to port a big project to a new
>>>> source
>>>> language compatibility. Leaving half your sourcebase in v1.6 and
>>>> the
>>>> other half in v1.7 is pretty much impossible today, it's all-or-
>>>> nothing. It should also be a much nicer solution to the 'assert in
>>>> v1.4' dilemma, which I guess is going to happen to v1.7 as well,
>>>> given
>>>> that 'module' is most likely going to become a keyword. Finally, it
>>>> makes java files a lot more portable; you no longer run into your
>>>> strings looking weird when you move your Windows-1252 codefile java
>>>> source to a mac, for example.
>>>>
>>>> Before we finish it though, some open questions we'd like some
>>>> feedback on:
>>>>
>>>> A) Technically, starting a file with "source 1.4" is obviously
>>>> silly;
>>>> javac v1.4 doesn't know about the source keyword and would thus
>>>> fail
>>>> immediately. However, practically, its still useful. Example: if
>>>> you've mostly converted a GWT project to GWT 1.5 (which uses java
>>>> 1.5
>>>> syntax), but have a few files remaining on GWT v1.4 (which uses
>>>> java
>>>> 1.4 syntax), then tossing a "source 1.4;" in those older files
>>>> eliminates all the generics warnings and serves as a reminder
>>>> that you
>>>> should still convert those at some point. However, it isn't -
>>>> actually-
>>>> compatible with a real javac 1.4. We're leaning to making "source
>>>> 1.6;" (and below) legal even when using a javac v1.7 or above, but
>>>> perhaps that's a bridge too far? We could go with magic comments
>>>> but
>>>> that seems like a very bad solution.
>>>>
>>>> also:
>>>>
>>>> Encoding is rather a hairy issue; javac will need to read the
>>>> file to
>>>> find the encoding, but to read a file, it needs to know about
>>>> encoding! Fortunately, *every single* popular encoding on
>>>> wikipedia's
>>>> popular encoding list at:
>>>>
>>>>
>>>> http://en.wikipedia.org/wiki/Character_encoding#Popular_character_encodings
>>>>
>>>> will encode "encoding own-name-in-that-encoding;" the same as ASCII
>>>> would, except for KOI-7 and UTF-7, (both 7 bit encodings that I
>>>> doubt
>>>> anyone ever uses to program java).
>>>>
>>>> Therefore, the proposal includes the following strategy to find the
>>>> encoding statement in a java source file without knowing the
>>>> encoding
>>>> beforehand:
>>>>
>>>> An entirely separate parser (the encoding parser) is run repeatedly
>>>> until the right encoding is found. First it'll decode the input
>>>> with
>>>> ISO-8859-1. If that doesn't work, UTF-16 (assume BE if no BOM, as
>>>> per
>>>> the java standard), then as UTF-32 (BE if no BOM), then the current
>>>> behaviour (-encoding parameter's value if any, otherwise platform
>>>> default encoding). This separate parser works as follows:
>>>>
>>>> 1. Ignore any comments and whitespace.
>>>> 3. Ignore the pattern (regexp-like-syntax, ): source\s+[^\s]+\s*;
>>>> - if
>>>> that pattern matches partially but is not correctly completed, that
>>>> parser run exits without finding an encoding, immediately.
>>>> 4. Find the pattern: encoding\s+([^\s]+)\s*; - if that pattern
>>>> matches
>>>> partially but is not correctly completed, that parser run exists
>>>> without finding an encoding, immediately. If it does complete, the
>>>> parser also exists immediately and returns the captured value.
>>>> 5. If it finds anything else, stop immediately, returning no
>>>> encoding
>>>> found.
>>>>
>>>> Once it's found something, the 'real' java parser will run using
>>>> the
>>>> found encoding (this overrides any -encoding on the command line).
>>>> Note that the encoding parser stops quickly; For example, if it
>>>> finds
>>>> a stray \0 or e.g. the letter 'i' (perhaps the first letter of an
>>>> import statement), it'll stop immediately.
>>>>
>>>> If an encoding is encountered that was not found during the
>>>> standard
>>>> decoding strategy (ISO-8859-1, UTF-16, UTF-32), but worked only
>>>> due to
>>>> a platform default/command line encoding param, (e.g. a platform
>>>> that
>>>> defaults to UTF-16LE without a byte order mark) a warning
>>>> explaining
>>>> that the encoding statement isn't doing anything is generated. Of
>>>> course, if the encoding doesn't match itself, you get an error
>>>> (putting "encoding UTF-16;" into a UTF-8 encoded file for
>>>> example). If
>>>> there is no encoding statement, the 'real' java parser does what it
>>>> does now: Use the -encoding parameter of javac, and if that wasn't
>>>> present, the platform default.
>>>>
>>>> However, there is 1 major and 1 minor problem with this approach:
>>>>
>>>> B) This means javac will need to read every source file many
>>>> times to
>>>> compile it.
>>>>
>>>> Worst case (no encoding keyword): 5 times.
>>>> Standard case if an encoding keyword: 2 times (3 times if UTF-16).
>>>>
>>>> Fortunately all runs should stop quickly, due to the encoding
>>>> parser's
>>>> penchant to quit very early. Javacs out there will either stuff the
>>>> entire source file into memory, or if not, disk cache should take
>>>> care
>>>> of it, but we can't prove beyond a doubt that this repeated parsing
>>>> will have no significant impact on compile time. Is this a
>>>> showstopper? Is the need to include a new (but small) parser into
>>>> javac a showstopper?
>>>>
>>>> C) Certain character sets, such as ISO-2022, can make the encoding
>>>> statement unreadable with the standard strategy if a comment
>>>> including
>>>> non-ASCII characters precedes the encoding statement. These
>>>> situations
>>>> are very rare (in fact, I haven't managed to find an example), so
>>>> is
>>>> it okay to just ignore this issue? If you add the encoding
>>>> statement
>>>> after a bunch of comments that make it invisible, and then
>>>> compile it
>>>> with the right -encoding parameter, you WILL get a warning that the
>>>> encoding statement isn't going to help a javac on another
>>>> platform /
>>>> without that encoding parameter to figure it out, so you just get
>>>> the
>>>> current status quo: your source file won't compile without an
>>>> explicit
>>>> -encoding parameter (or if that happens to be the platform
>>>> default).
>>>> Should this be mentioned in the proposal? Should the compiler
>>>> (and the
>>>> proposal) put effort into generating a useful warning message,
>>>> such as
>>>> figuring out if it WOULD parse correctly if the encoding
>>>> statement is
>>>> at the very top of the source file, vs. suggesting to recode in
>>>> UTF-8?
>>>>
>>>> and a final dilemma:
>>>>
>>>> D) Should we separate the proposals for source and encoding
>>>> keywords?
>>>> The source keyword is more useful and a lot simpler overall than
>>>> the
>>>> encoding keyword, but they do sort of go together.
>>>
>>> Separate. Another reason is: the argument of applying different
>>> settings
>>> to
>>> different parts of the project is much less valid with encoding than
>>> with source.
>>>
>>>>
>>>> --Reinier Zwitserloot and Roel Spilker
>>>>
>>>>
>>> Overall: I would prefer command line options enhanced to handle the
>>> situation
>>> rather than language change.
>>>
>>> Igor Karp
>>
>>
More information about the coin-dev
mailing list