Module-file format (DRAFT)

Tue Jan 19 12:09:32 PST 2010

* Should SectionSize/SectionFileHeader.usize/csize also join
ModuleFileHeader.u/csize and become u8 instead of the current u4?

* Should SectionHeader include a new u2 'version' field? analysis for this
issue:

If an individual section type gets an update, then either the entire module
file format needs a version rev, which means all dependent tools immediately
stop working with it until they also upgrade their code. This is a bad thing
if the section's type identifies it as something that is irrelevant for a
certain module file processor. Alternatively, the section can rev itself by
picking a new FileConstants.ModuleFile.SectionType, but this will surely
lead to haphazard type ids or a risk of running out of type ids if each id
claims lets say 100 slots for potential future versioning. Even if this is
not a problem, a dependent tool that is looking for, say, an "Author
Information" section will get confused and fail with an error 'module file
is missing author information', even though the actual problem is that the
author information block is present as a version too new for the tool to
understand. If the version of a section was separate, then the tool could
correctly report that the relevant section is stored in a version that's too
new for the tool to understand.

* What's the point of having multiple different hash algorithms? I presume
the hash is simply a mechanism to assert that the file has not been tampered
with or corrupted during transit, and possibly as a quick way to identify a
given module file as having remained unchanged, and that it isn't intended
to fulfill the rule of a signature like signed jars (if that does get added
to the module file format, I'd presume this will occur via a new section,
and not via the hash section). analysis for this issue:

Every so often a hash algorithm gets cryptographically compromised. SHA-1 is
basically in the process of getting compromised in this fashion, and I would
certainly suggest SHA256 or SHA512 is used for the module file format, which
so far haven't been compromised in any practical manner. It is of course
feasible to believe SHA512 gets compromised at some point. There's a fix for
that, though: Rev the version of the module file format itself, and dictate
a new hash algorithm in the new version.

The hash is something everyone basically needs to adhere to; if it is to
have any significant security impact, _ALL_ tools that read module files
should, with strong preference, check it and refuse to process module files
with a bad hash. However, if there's a smattering of different hash
algorithms available, this is going to make writing these tools far more
difficult. After all, you'd have to carefully write, test, and maintain each
and every legal hash algorithm. I can easily imagine this scenario
occurring:

java itself understands all of lets say 5 legal hash algorithms. However,
all tools that ship with the JDK in practice only ever generate 4 of those.
Some fairly obscure tool for some reason decides to hash with rarely used
hasher algorithm 5, and it is released to the world as it gets tested with
the JVM itself, which understands it. Some other fairly obscure tool has a
bug in its hasher for hash type 5, but it, too, passes internal testing and
is released to the world because it gets tested with the output from the JDK
tools, which never generate this hash. Then some poor soul gets confronted
with the fact that obscure tool number one (which generates hash #5) crashes
when used with obscure tool number two (which has a bug in its hash #5
algorithm reader).

Seems simpler to me to just dictate hash algorithm with the understanding
that if it is ever compromised, the module file format itself will rev up a
version to fix it.

* Without knowing every possible section type, how can a tool know if a
certain section is based around SectionFileHeaders, or SectionSize? It
cannot check the type of the next chunk - if it is
indeed FileConstants.ModuleFile.SectionType.FILE it's most likely
SectionFileHeader based, but it could just be a coincidence and part of the
SectionSize.csize field. What happens if a SectionFileHeader based section
has 0 files in it? As it stands, HashSection's type numbers cannot conflict
with FileConstants.ModuleFile.SectionType.FILE lest tools are forced to keep
a running count of the file's csize to realize it must have reached the end,
which sounds rather cumbersome.

This can be fixed relatively cheaply: Reserve 1 bit in the
SectionHeader.type field as indicating whether the section is single-entity
or FileHeaders based, then let a FileHeaders-based section be followed by a
u4: number of files that follow. A tool that neither knows nor cares about a
FileHeaders based section can now easily skip it by looping through each
file, reading the csize and pathLength from it, and skipping the appropriate
number of bytes. Alternatively, move the SectionSize block into the
SectionHeader block (in other words, make it mandatory even for files-based
sections).

--Reinier Zwitserloot