Module-file format (DRAFT)

Tue Jan 19 21:57:27 PST 2010

> Date: Tue, 19 Jan 2010 21:09:32 +0100
> From: Reinier Zwitserloot <reinier at zwitserloot.com>

Thanks for your thorough comments -- replies below.

> * Should SectionSize/SectionFileHeader.usize/csize also join
> ModuleFileHeader.u/csize and become u8 instead of the current u4?

It's difficult to imagine a module-file section being larger than 4GB.
I upgraded the ModuleFileHeader size fields in order to accomodate the
theoretical maximum, not because I think that anyone will actually go
and create multi-gigabyte module files in practice.

> * Should SectionHeader include a new u2 'version' field? analysis for this
> issue:
> 
> If an individual section type gets an update, then either the entire module
> file format needs a version rev, which means all dependent tools immediately
> stop working with it until they also upgrade their code. This is a bad thing
> if the section's type identifies it as something that is irrelevant for a
> certain module file processor. Alternatively, the section can rev itself by
> picking a new FileConstants.ModuleFile.SectionType, but this will surely
> lead to haphazard type ids or a risk of running out of type ids if each id
> claims lets say 100 slots for potential future versioning. Even if this is
> not a problem, a dependent tool that is looking for, say, an "Author
> Information" section will get confused and fail with an error 'module file
> is missing author information', even though the actual problem is that the
> author information block is present as a version too new for the tool to
> understand. If the version of a section was separate, then the tool could
> correctly report that the relevant section is stored in a version that's too
> new for the tool to understand.

I'd like to see more use cases before adding this level of complexity.

A better place for the data in an "author information" section is likely
to be the module-info file itself, in the form of annotations.

> * What's the point of having multiple different hash algorithms? I presume
> the hash is simply a mechanism to assert that the file has not been tampered
> with or corrupted during transit, and possibly as a quick way to identify a
> given module file as having remained unchanged, and that it isn't intended
> to fulfill the rule of a signature like signed jars (if that does get added
> to the module file format, I'd presume this will occur via a new section,
> and not via the hash section).

Most digital-signature information would go into a new section, assuming
we store it in the file.  Signatures could, however, leverage the hashes
already specified.

>                                analysis for this issue:
> 
> Every so often a hash algorithm gets cryptographically compromised. ...
>                                                        ... There's a fix for
> that, though: Rev the version of the module file format itself, and dictate
> a new hash algorithm in the new version.
> 
> The hash is something everyone basically needs to adhere to; if it is to
> have any significant security impact, _ALL_ tools that read module files
> should, with strong preference, check it and refuse to process module files
> with a bad hash. However, if there's a smattering of different hash
> algorithms available, this is going to make writing these tools far more
> difficult. ...
> 
> Seems simpler to me to just dictate hash algorithm with the understanding
> that if it is ever compromised, the module file format itself will rev up a
> version to fix it.

It's already the case that the module-file format version must be bumped
if a new HashType is defined, since existing readers won't understand the
new type.  In other words, the HashType enum in the FileConstants class
already constrains the supported hash algorithms.

The benefit of allowing multiple hash algorithms is that in some
applications (e.g., embedded devices) a smaller hash might be preferred,
due to computational constraints, and also acceptable if access to the
device is suitably constrained.  I don't think that supporting a few
different hash algorithms is all that problematic, assuming that any such
algorithm is part of the standard JRE.

> * Without knowing every possible section type, how can a tool know if a
> certain section is based around SectionFileHeaders, or SectionSize?

The FileConstants.ModuleFile.SectionType.hasFiles() method answers this
question, though it may be better for the format itself to record it.
I'll consider that.

>                                                                     It
> cannot check the type of the next chunk - if it is
> indeed FileConstants.ModuleFile.SectionType.FILE it's most likely
> SectionFileHeader based, but it could just be a coincidence and part of the
> SectionSize.csize field.

That could be, but invoking the hasFiles() method on the section's type
field will tell you which interpretation to take.

>                          What happens if a SectionFileHeader based section
> has 0 files in it? As it stands, HashSection's type numbers cannot conflict
> with FileConstants.ModuleFile.SectionType.FILE lest tools are forced to keep
> a running count of the file's csize to realize it must have reached the end,
> which sounds rather cumbersome.

No, a HashSection is preceded by its own SectionHeader and SectionSize
structures.  (Hmm, HashSection should really be named HashContent; I'll
fix that.)

> This can be fixed relatively cheaply: Reserve 1 bit in the
> SectionHeader.type field as indicating whether the section is single-entity
> or FileHeaders based, then let a FileHeaders-based section be followed by a
> u4: number of files that follow. A tool that neither knows nor cares about a
> FileHeaders based section can now easily skip it by looping through each
> file, reading the csize and pathLength from it, and skipping the appropriate
> number of bytes. Alternatively, move the SectionSize block into the
> SectionHeader block (in other words, make it mandatory even for files-based
> sections).

Moving the section size up into the overall section header does make
sense for the purpose of skipping whole sections; I'll do that.

- Mark