PRE-PROPOSAL: Source and Encoding keyword

Sat Mar 7 09:55:57 PST 2009

In the proposal, there is a paragraph about having these keywords in the
package-info.java (and in the module-info.java) as well to allow for a more
coarse-grained declaration. We didný put this in the PRE-PROPOSAL email
since it was already so long. Please keep to the questions asked in the
PRE-PROPOSAL, and wait for a discussion until the PROPOSAL is sent.

Roel

On Sat, Mar 7, 2009 at 6:27 PM, Igor Karp <igor.v.karp at gmail.com> wrote:

> Reiner,
>
> please see the comments inline.
>
> On Fri, Mar 6, 2009 at 11:39 PM, Reinier Zwitserloot
> <reinier at zwitserloot.com> wrote:
> > Igor,
> >
> > how could the command line options be expanded? Allow -encoding to
> specify a
> > separate encoding for each file? I don't see how that can work.
> For example: allow multiple -encoding options and add optional path to
> encoding -encoding <encoding>[,<path>]
> Where path can be either a package (settings applied to the package
> and every package under it) or a single file for maximum precision.
> So one can have:
> -encoding X - encoding Y,a.b -encoding Z,a.b.c -encoding
> X,a.b.c.d.IAmSpecial
> IAMSpecial.java will get encoding X,
> everything else under a.b.c will get encoding Z,
> everything else under a.b will get encoding Y
> and the rest will get encoding X.
> Same approach can be applied to -source.
>
> > There's no
> > way I or anyone else is going to edit a build script (be it just javac, a
> > home-rolled thing, ant, rake, make, maven, ivy, etcetera) to carefully
> > enumerate every file's source compatibility level.
> Sure, thats what argfiles are for: store the settings in a file and
> use javac @argfile.
>
> And doing it as proposed above on a package level would make it more
> manageable.
> Remember in your proposal the only option is to specify it on a file
> level (this is fixable i guess).
>
> > Changing the command line
> > options also incurs the neccessary wrath of all those build tool
> developers
> > as they'd have to update their software to handle the new option (adding
> an
> > option is a change too!)
> Not more than changing the language itself.
>
> >
> > Could you also elaborate on why you don't like it? For example, how can
> the
> > benefits of having (more) portable source files, easier migration, and a
> > much cleaner solution to e.g. the assert-in-javac1.4 be achieved with
> e.g.
> > command line options, or do you not consider any of those worthwhile?
> I fully support the goal. I even see it as is a bit too narrow (see
> below). But I do not see a need to change the language to achieve that
> goal.
>
> On a conceptual level I see these options as a metadata of the source
> files and I don't like the idea of coupling it with the file.
> One can avoid all this complexity of extra parsing by specifying the
> encoding in an external file. This external file does not have
> itself to be in that encoding. In fact it can be restricted to be
> always in ASCII.
>
> I think the addition of an optional path and allowing multiple use of
> the same option approach is much more scalable: it could be extended
> to the other existing options (like -deprecation, -Xlint, etc.) and to
> the options that might appear in the future.
>
> I wish I could concentrate on deprecations in a certain package and
> ignore them everywhere else for now:
> javac -deprecation,really.rusty.one ...
> Finished with (or gave up on ;) that one and want to switch to the next
> one:
> javac -deprecation,another.old.one
>
> Igor Karp
>
> >
> > As an aside, how do people approach project coin submissions? I tend to
> look
> > at a proposal's value, which is its benefit divided by the disadvantages
> > (end-programmer complexity to learn, amount of changes needed to javac
> > and/or JVM, and restrictions on potential future expansions). One of the
> > reasons I'm writing this up with Roel is because the disadvantages seemed
> to
> > be almost nonexistent on the outset (the encoding stuff made it more
> > complicated, but at least the complication is entirely hidden from java
> > developer's eyes, so it value proposal is still aces in my book). If
> there's
> > a goal to keep the total language changes, no matter how simple they are,
> > down to a small set, then benefit regardless of disadvantages is the
> better
> > yardstick.
> >
> >  --Reinier Zwitserloot
> >
> >
> >
> > On Mar 7, 2009, at 08:15, Igor Karp wrote:
> >
> >> On Fri, Mar 6, 2009 at 10:03 PM, Reinier Zwitserloot
> >> <reinier at zwitserloot.com> wrote:
> >>>
> >>> We have written up a proposal for adding a 'source' and 'encoding'
> >>> keyword (alternatives to the -source and -encoding keywords on the
> >>> command line; they work pretty much just as you expect). The keywords
> >>> are context sensitive and must both appear before anything else other
> >>> than comments to be parsed. In case the benefit isn't obvious: It is a
> >>> great help when you are trying to port a big project to a new source
> >>> language compatibility. Leaving half your sourcebase in v1.6 and the
> >>> other half in v1.7 is pretty much impossible today, it's all-or-
> >>> nothing. It should also be a much nicer solution to the 'assert in
> >>> v1.4' dilemma, which I guess is going to happen to v1.7 as well, given
> >>> that 'module' is most likely going to become a keyword. Finally, it
> >>> makes java files a lot more portable; you no longer run into your
> >>> strings looking weird when you move your Windows-1252 codefile java
> >>> source to a mac, for example.
> >>>
> >>> Before we finish it though, some open questions we'd like some
> >>> feedback on:
> >>>
> >>> A) Technically, starting a file with "source 1.4" is obviously silly;
> >>> javac v1.4 doesn't know about the source keyword and would thus fail
> >>> immediately. However, practically, its still useful. Example: if
> >>> you've mostly converted a GWT project to GWT 1.5 (which uses java 1.5
> >>> syntax), but have a few files remaining on GWT v1.4 (which uses java
> >>> 1.4 syntax), then tossing a "source 1.4;" in those older files
> >>> eliminates all the generics warnings and serves as a reminder that you
> >>> should still convert those at some point. However, it isn't -actually-
> >>> compatible with a real javac 1.4. We're leaning to making "source
> >>> 1.6;"  (and below) legal even when using a javac v1.7 or above, but
> >>> perhaps that's a bridge too far? We could go with magic comments but
> >>> that seems like a very bad solution.
> >>>
> >>> also:
> >>>
> >>> Encoding is rather a hairy issue; javac will need to read the file to
> >>> find the encoding, but to read a file, it needs to know about
> >>> encoding! Fortunately, *every single* popular encoding on wikipedia's
> >>> popular encoding list at:
> >>>
> >>>
> >>>
> http://en.wikipedia.org/wiki/Character_encoding#Popular_character_encodings
> >>>
> >>> will encode "encoding own-name-in-that-encoding;" the same as ASCII
> >>> would, except for KOI-7 and UTF-7, (both 7 bit encodings that I doubt
> >>> anyone ever uses to program java).
> >>>
> >>> Therefore, the proposal includes the following strategy to find the
> >>> encoding statement in a java source file without knowing the encoding
> >>> beforehand:
> >>>
> >>> An entirely separate parser (the encoding parser) is run repeatedly
> >>> until the right encoding is found. First it'll decode the input with
> >>> ISO-8859-1. If that doesn't work, UTF-16 (assume BE if no BOM, as per
> >>> the java standard), then as UTF-32 (BE if no BOM), then the current
> >>> behaviour (-encoding parameter's value if any, otherwise platform
> >>> default encoding). This separate parser works as follows:
> >>>
> >>> 1. Ignore any comments and whitespace.
> >>> 3. Ignore the pattern (regexp-like-syntax, ): source\s+[^\s]+\s*; - if
> >>> that pattern matches partially but is not correctly completed, that
> >>> parser run exits without finding an encoding, immediately.
> >>> 4. Find the pattern: encoding\s+([^\s]+)\s*; - if that pattern matches
> >>> partially but is not correctly completed, that parser run exists
> >>> without finding an encoding, immediately. If it does complete, the
> >>> parser also exists immediately and returns the captured value.
> >>> 5. If it finds anything else, stop immediately, returning no encoding
> >>> found.
> >>>
> >>> Once it's found something, the 'real' java parser will run using the
> >>> found encoding (this overrides any -encoding on the command line).
> >>> Note that the encoding parser stops quickly; For example, if it finds
> >>> a stray \0 or e.g. the letter 'i' (perhaps the first letter of an
> >>> import statement), it'll stop immediately.
> >>>
> >>> If an encoding is encountered that was not found during the standard
> >>> decoding strategy (ISO-8859-1, UTF-16, UTF-32), but worked only due to
> >>> a platform default/command line encoding param, (e.g. a platform that
> >>> defaults to UTF-16LE without a byte order mark) a warning explaining
> >>> that the encoding statement isn't doing anything is generated. Of
> >>> course, if the encoding doesn't match itself, you get an error
> >>> (putting "encoding UTF-16;" into a UTF-8 encoded file for example). If
> >>> there is no encoding statement, the 'real' java parser does what it
> >>> does now: Use the -encoding parameter of javac, and if that wasn't
> >>> present, the platform default.
> >>>
> >>> However, there is 1 major and 1 minor problem with this approach:
> >>>
> >>> B) This means javac will need to read every source file many times to
> >>> compile it.
> >>>
> >>> Worst case (no encoding keyword): 5 times.
> >>> Standard case if an encoding keyword: 2 times (3 times if UTF-16).
> >>>
> >>> Fortunately all runs should stop quickly, due to the encoding parser's
> >>> penchant to quit very early. Javacs out there will either stuff the
> >>> entire source file into memory, or if not, disk cache should take care
> >>> of it, but we can't prove beyond a doubt that this repeated parsing
> >>> will have no significant impact on compile time. Is this a
> >>> showstopper? Is the need to include a new (but small) parser into
> >>> javac a showstopper?
> >>>
> >>> C) Certain character sets, such as ISO-2022, can make the encoding
> >>> statement unreadable with the standard strategy if a comment including
> >>> non-ASCII characters precedes the encoding statement. These situations
> >>> are very rare (in fact, I haven't managed to find an example), so is
> >>> it okay to just ignore this issue? If you add the encoding statement
> >>> after a bunch of comments that make it invisible, and then compile it
> >>> with the right -encoding parameter, you WILL get a warning that the
> >>> encoding statement isn't going to help a javac on another platform /
> >>> without that encoding parameter to figure it out, so you just get the
> >>> current status quo: your source file won't compile without an explicit
> >>> -encoding parameter (or if that happens to be the platform default).
> >>> Should this be mentioned in the proposal? Should the compiler (and the
> >>> proposal) put effort into generating a useful warning message, such as
> >>> figuring out if it WOULD parse correctly if the encoding statement is
> >>> at the very top of the source file, vs. suggesting to recode in UTF-8?
> >>>
> >>> and a final dilemma:
> >>>
> >>> D) Should we separate the proposals for source and encoding keywords?
> >>> The source keyword is more useful and a lot simpler overall than the
> >>> encoding keyword, but they do sort of go together.
> >>
> >> Separate. Another reason is: the argument of applying different settings
> >> to
> >> different parts of the project is much less valid with encoding than
> >> with source.
> >>
> >>>
> >>> --Reinier Zwitserloot and Roel Spilker
> >>>
> >>>
> >> Overall: I would prefer command line options enhanced to handle the
> >> situation
> >> rather than language change.
> >>
> >> Igor Karp
> >
> >
>
>