-encoding and -source in module-info.java
Jesse Glick
jesse.glick at oracle.com
Tue Jan 24 04:07:29 PST 2012
The encoding and source level of a module are fundamental attributes of its sources, without which you cannot reliably even parse a syntax tree, so I think they should be
declared in module-info.java. Otherwise it is left up to someone calling javac by hand, or a build script, to specify these options; that is potentially error-prone, and
means that tools which inspect sources (including but not limited to IDEs) need to have some separate mechanism for configuration of these attributes: you cannot just
hand them the sourcepath and let them run.
I am assuming that all files in the sourcepath use the same encoding and source level, which seems a reasonable restriction.
As to the source level, obviously given that JDK 8 will introduce module-info.java, "8" (or "1.8") seems like the right default value; but a syntax ought to be defined
for specifying a newer level, e.g.
source 1.9; // or 9?
Furthermore I think that JDK 9+ versions of javac should keep the same default source level - you should need to explicitly mark what version of the Java language your
module expects. Otherwise a module might compile differently according to which version of javac was used, which is undesirable, and tools cannot guess what version you
meant. A little more verbosity here seems to be justified.
Whether the bytecode target (-target) should be specified in module-info.java is another question. I have seen projects built using -target 5 for JDK 5 compatibility but
also in a separate artifact using -target 6 for speed on JDK 6+ (split verifier). Probably the target level should default to the source level, and in the rare case that
you need to override this, you can do so using a javac command option - this has no impact on tools which just need to parse and analyze source files.
As to the encoding, something like
encoding ISO-8859-2;
would suffice. The obvious problems for encoding are
1. What should the default value be? javac currently uses the platform default encoding, which IMHO is a horrible choice because it means that two people running javac
with the same parameters on the same files may be producing different classes and/or warning messages. I would suggest making UTF-8 be the default when compiling in
module mode (leaving the old behavior intact for legacy mode). For developers who want to keep sources in a different character set, adding one line per module-info.java
does not seem like much of a burden.
2. What is module-info.java itself encoded in? If not UTF-8, then you need to be able to reliably find the encoding declaration and then rescan the file in that encoding.
That is easy for most encodings (just do an initial scan in ISO-8859-1), including everything commonly used by developers AFAIK; a little trickier for UTF-16/32-type
encodings but possible by ignoring 0x00/0xFE/0xFF; and only fails on some mainframe charsets, old JIS variants, and dingbats (*). Even those rare cases are probably
guessable. [1]
(*) Demo program:
import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.util.Arrays;
public class CharsetTest {
public static void main(String[] args) throws UnsupportedEncodingException {
Charset raw = Charset.forName("ISO-8859-1");
for (Charset c : Charset.availableCharsets().values()) {
String text = "/* leading comment */\nmodule test {\n encoding " + c.name() + ";\n}\n";
byte[] encoded;
try {
encoded = text.getBytes(c);
} catch (UnsupportedOperationException x) {
System.out.println("cannot encode using " + c.name());
continue;
}
if (Arrays.equals(encoded, text.getBytes(raw))) {
System.out.println("OK in " + c.name());
} else if (new String(encoded, raw).contains(" encoding " + c.name() + ";")) {
System.out.println("substring match in " + c.name());
dump(encoded);
} else if (new String(encoded, raw).replace("\u0000", "").contains(" encoding " + c.name() + ";")) {
System.out.println("NUL-stripped match in " + c.name());
dump(encoded);
} else {
System.out.println("garbled in " + c.name());
dump(encoded);
}
}
}
private static void dump(byte[] encoded) {
for (byte b : encoded) {
if (b >= 32 && b <= 126 || b == '\n' || b == '\r') {
System.out.write(b);
} else if (b == 0) {
System.out.print('@');
} else {
System.out.printf("\\%02X", b);
}
}
System.out.println();
}
private CharsetTest() {}
}
[1] http://jchardet.sourceforge.net/
More information about the jigsaw-dev
mailing list