-encoding and -source in module-info.java

Tue Jan 24 04:07:29 PST 2012

The encoding and source level of a module are fundamental attributes of its sources, without which you cannot reliably even parse a syntax tree, so I think they should be 
declared in module-info.java. Otherwise it is left up to someone calling javac by hand, or a build script, to specify these options; that is potentially error-prone, and 
means that tools which inspect sources (including but not limited to IDEs) need to have some separate mechanism for configuration of these attributes: you cannot just 
hand them the sourcepath and let them run.

I am assuming that all files in the sourcepath use the same encoding and source level, which seems a reasonable restriction.

As to the source level, obviously given that JDK 8 will introduce module-info.java, "8" (or "1.8") seems like the right default value; but a syntax ought to be defined 
for specifying a newer level, e.g.

   source 1.9; // or 9?

Furthermore I think that JDK 9+ versions of javac should keep the same default source level - you should need to explicitly mark what version of the Java language your 
module expects. Otherwise a module might compile differently according to which version of javac was used, which is undesirable, and tools cannot guess what version you 
meant. A little more verbosity here seems to be justified.

Whether the bytecode target (-target) should be specified in module-info.java is another question. I have seen projects built using -target 5 for JDK 5 compatibility but 
also in a separate artifact using -target 6 for speed on JDK 6+ (split verifier). Probably the target level should default to the source level, and in the rare case that 
you need to override this, you can do so using a javac command option - this has no impact on tools which just need to parse and analyze source files.

As to the encoding, something like

   encoding ISO-8859-2;

would suffice. The obvious problems for encoding are

1. What should the default value be? javac currently uses the platform default encoding, which IMHO is a horrible choice because it means that two people running javac 
with the same parameters on the same files may be producing different classes and/or warning messages. I would suggest making UTF-8 be the default when compiling in 
module mode (leaving the old behavior intact for legacy mode). For developers who want to keep sources in a different character set, adding one line per module-info.java 
does not seem like much of a burden.

2. What is module-info.java itself encoded in? If not UTF-8, then you need to be able to reliably find the encoding declaration and then rescan the file in that encoding. 
That is easy for most encodings (just do an initial scan in ISO-8859-1), including everything commonly used by developers AFAIK; a little trickier for UTF-16/32-type 
encodings but possible by ignoring 0x00/0xFE/0xFF; and only fails on some mainframe charsets, old JIS variants, and dingbats (*). Even those rare cases are probably 
guessable. [1]

(*) Demo program:

import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.util.Arrays;
public class CharsetTest {
     public static void main(String[] args) throws UnsupportedEncodingException {
         Charset raw = Charset.forName("ISO-8859-1");
         for (Charset c : Charset.availableCharsets().values()) {
             String text = "/* leading comment */\nmodule test {\n  encoding " + c.name() + ";\n}\n";
             byte[] encoded;
             try {
                 encoded = text.getBytes(c);
             } catch (UnsupportedOperationException x) {
                 System.out.println("cannot encode using " + c.name());
                 continue;
             }
             if (Arrays.equals(encoded, text.getBytes(raw))) {
                 System.out.println("OK in " + c.name());
             } else if (new String(encoded, raw).contains("  encoding " + c.name() + ";")) {
                 System.out.println("substring match in " + c.name());
                 dump(encoded);
             } else if (new String(encoded, raw).replace("\u0000", "").contains("  encoding " + c.name() + ";")) {
                 System.out.println("NUL-stripped match in " + c.name());
                 dump(encoded);
             } else {
                 System.out.println("garbled in " + c.name());
                 dump(encoded);
             }
         }
     }
     private static void dump(byte[] encoded) {
         for (byte b : encoded) {
             if (b >= 32 && b <= 126 || b == '\n' || b == '\r') {
                 System.out.write(b);
             } else if (b == 0) {
                 System.out.print('@');
             } else {
                 System.out.printf("\\%02X", b);
             }
         }
         System.out.println();
     }
     private CharsetTest() {}
}

[1] http://jchardet.sourceforge.net/