Missing issue: Version string format

Mon Mar 21 13:49:22 UTC 2016

On 03/11/2016 04:13 PM, David M. Lloyd wrote:
> The current java.lang.module.ModuleDescriptor.Version class contains the
> comment:
>
> "Vaguely Debian-like version strings, for now.
> "This will, eventually, change."
>
> At some point the syntax and semantics of version designators has to be
> worked out and agreed upon.  Ideally the scheme would be compatible with
> as many existing widely deployed schemes as possible in terms of allowed
> syntax, and as much as possible, collation order (at least within the
> context of other modules from the same versioning scheme).

Judging from the lack of response, I assume that nobody has done any 
work on this, so I have a proposal.

*** PLEASE review this in detail and post responses and criticisms ASAP. 
  I am interpreting silence as agreement/approval! ***

In particular, the syntax and collation rules could use some discussion!

•Version Requirements•

Versions must abide by a consistent, easily describable syntax.

Versions must support as many widely-used versioning schemes as 
possible, in a manner which is as interoperable as possible.

Versions must collate in a manner consistent with expectations in terms 
of existing systems, to the maximum extent possible.

•Version Syntax•

I propose that a version conform to the following EBNF syntax:

    alpha = ? all Unicode letters (open for discussion) ?
    number = ? all Unicode digits (open for discussion) ?
    separator = "-" | "+" | "_" | "." | ? alpha-to-number transition ? | 
? number-to-alpha transition ?
    part = number* | alpha*
    version = part { separator part }

The special transitions mean that strings such as "8u12" will count as a 
three-part version "8" (sep) "u" (sep) "12" and would collate as such.

•Unicode considerations•

All version components would be normalized in NFKC form, in order to 
ensure consistent collation.

•Collation•

Versions shall abide by the following collation rules.

Each part and separator in the version contributes to collation order. 
Since a version is comprised of strictly alternating parts and 
separators, there is no sensible or defined collation order between 
parts and separators.

Number parts shall sort before alpha parts.

The sort order for separators should be as follows:
   • transitions sort highest (first)
   • underscores "_" sort next
   • pluses "+" sort next
   • hyphens "-" sort next
   • dots "." sort lowest (last)

•Compatibility•

OpenJDK and Oracle JDK versions follow a few different mildly complex 
schemes but can be more simply characterized by a few examples which are 
valid in different contexts:
  • 1.3.0
  • 1.3.1-beta
  • 1.3.1_05-ea
  • 1.8.0_66-b17
  • 8u66
  • 9-ea

All of these examples will parse and collate in a manner that seems 
consistent with expectations.

OSGi versions are in the form: number "." number "." number [ "." ? any 
string ? ].  Due to the arbitrary nature of the optional final 
(qualifier) segment, there exist a set of OSGi versions which are not 
strictly compatible with this scheme, and a set of OSGi versions which 
are compatible but whose collation order might be affected by this scheme.

Maven versions are highly under-specified, but using the 
org.sonatype.aether.util.version.GenericVersion class as a reference 
indicates that Maven is employing a similar scheme, including empty 
"transition" separators, with the exception that all separators appear 
to be considered equal.  This may cause certain projects to collate 
differently, for example in the event that the separator was switched 
from "-" to "." along a branch's development lifecycle.  In addition, 
certain strings such as "alpha", "beta" etc. are specially detected and 
ordered.  However, other than the "ga" or "final" string, these strings 
already collate naturally, and it is a fairly common practice to rely on 
natural collation regardless, which may mitigate interoperability issues.

Debian versions allow ":" and "~" characters, and also allow parts to be 
empty, both of which are extensions that could be applied to this scheme 
if desired, as long as collation rules could be worked out for them.

•Implementation•

Two implementation approaches seem obvious.

The first approach uses an internal linked list comprised of alternating 
segments of parts and separators.  Parts and separators have collation 
methods which consider the current part or separator, then fall back to 
the next link (if any).  This approach is simple and elegant, however 
has substantial memory overhead due to the number of objects required 
(for example, the string "1.8.0_66-b17" requires six parts and five 
separators for a total of 11 objects, which seems excessive).

The second approach simply stores the content as a string, and uses an 
internal tokenizer to parse, validate, and collate.  This approach may 
be slightly more verbose in implementation but should be far more 
memory-efficient, generally requiring one or two temporary object 
allocations per parse/collate operation, and otherwise only requiring 
the memory necessary to hold the String object of the version plus the 
memory requirements of the Version object itself.

The existing Version class is designed more for simplicity than 
efficiency, using Lists of boxed objects internally and so forth.  While 
this is adequate for prototyping, I think the latter String/tokenizer 
based design is a better long-term solution and that is what I will 
pursue unless there is a strong argument otherwise.

Looking forward to discussion,
-- 
- DML