-encoding and -source in module-info.java

Tue Jan 24 17:08:21 PST 2012

Compilation units in the Java programming language consist of UTF-16 
code units. No desire to change that. A compiler may support other 
encodings but that's an implementation detail and so does not belong in 
the Java SE platform's idea of a module declaration.

Documenting and/or enforcing the language level is a much trickier topic 
that I doubt will be addressed by the module system. The API level 
available to a program can theoretically be "configured" by depending on 
a given version of java.base, but it's still up to a compiler to a) 
switch its language level to match the desired API and b) switch its own 
use of platform classes to match the desired API 
(http://blogs.oracle.com/darcy/entry/bootclasspath_older_source).

Alex

On 1/24/2012 4:07 AM, Jesse Glick wrote:
> The encoding and source level of a module are fundamental attributes of
> its sources, without which you cannot reliably even parse a syntax tree,
> so I think they should be declared in module-info.java. Otherwise it is
> left up to someone calling javac by hand, or a build script, to specify
> these options; that is potentially error-prone, and means that tools
> which inspect sources (including but not limited to IDEs) need to have
> some separate mechanism for configuration of these attributes: you
> cannot just hand them the sourcepath and let them run.
>
> I am assuming that all files in the sourcepath use the same encoding and
> source level, which seems a reasonable restriction.
>
>
> As to the source level, obviously given that JDK 8 will introduce
> module-info.java, "8" (or "1.8") seems like the right default value; but
> a syntax ought to be defined for specifying a newer level, e.g.
>
> source 1.9; // or 9?
>
> Furthermore I think that JDK 9+ versions of javac should keep the same
> default source level - you should need to explicitly mark what version
> of the Java language your module expects. Otherwise a module might
> compile differently according to which version of javac was used, which
> is undesirable, and tools cannot guess what version you meant. A little
> more verbosity here seems to be justified.
>
> Whether the bytecode target (-target) should be specified in
> module-info.java is another question. I have seen projects built using
> -target 5 for JDK 5 compatibility but also in a separate artifact using
> -target 6 for speed on JDK 6+ (split verifier). Probably the target
> level should default to the source level, and in the rare case that you
> need to override this, you can do so using a javac command option - this
> has no impact on tools which just need to parse and analyze source files.
>
>
> As to the encoding, something like
>
> encoding ISO-8859-2;
>
> would suffice. The obvious problems for encoding are
>
> 1. What should the default value be? javac currently uses the platform
> default encoding, which IMHO is a horrible choice because it means that
> two people running javac with the same parameters on the same files may
> be producing different classes and/or warning messages. I would suggest
> making UTF-8 be the default when compiling in module mode (leaving the
> old behavior intact for legacy mode). For developers who want to keep
> sources in a different character set, adding one line per
> module-info.java does not seem like much of a burden.
>
> 2. What is module-info.java itself encoded in? If not UTF-8, then you
> need to be able to reliably find the encoding declaration and then
> rescan the file in that encoding. That is easy for most encodings (just
> do an initial scan in ISO-8859-1), including everything commonly used by
> developers AFAIK; a little trickier for UTF-16/32-type encodings but
> possible by ignoring 0x00/0xFE/0xFF; and only fails on some mainframe
> charsets, old JIS variants, and dingbats (*). Even those rare cases are
> probably guessable. [1]
>
>
> (*) Demo program:
>
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Arrays;
> public class CharsetTest {
> public static void main(String[] args) throws
> UnsupportedEncodingException {
> Charset raw = Charset.forName("ISO-8859-1");
> for (Charset c : Charset.availableCharsets().values()) {
> String text = "/* leading comment */\nmodule test {\n encoding " +
> c.name() + ";\n}\n";
> byte[] encoded;
> try {
> encoded = text.getBytes(c);
> } catch (UnsupportedOperationException x) {
> System.out.println("cannot encode using " + c.name());
> continue;
> }
> if (Arrays.equals(encoded, text.getBytes(raw))) {
> System.out.println("OK in " + c.name());
> } else if (new String(encoded, raw).contains(" encoding " + c.name() +
> ";")) {
> System.out.println("substring match in " + c.name());
> dump(encoded);
> } else if (new String(encoded, raw).replace("\u0000", "").contains("
> encoding " + c.name() + ";")) {
> System.out.println("NUL-stripped match in " + c.name());
> dump(encoded);
> } else {
> System.out.println("garbled in " + c.name());
> dump(encoded);
> }
> }
> }
> private static void dump(byte[] encoded) {
> for (byte b : encoded) {
> if (b >= 32 && b <= 126 || b == '\n' || b == '\r') {
> System.out.write(b);
> } else if (b == 0) {
> System.out.print('@');
> } else {
> System.out.printf("\\%02X", b);
> }
> }
> System.out.println();
> }
> private CharsetTest() {}
> }
>
>
> [1] http://jchardet.sourceforge.net/