Valid characters in a module name

David M. Lloyd david.lloyd at redhat.com
Tue Jan 3 22:13:19 UTC 2017


On 01/03/2017 03:38 PM, Ess Kay wrote:
>> Java EE and JBoss Modules both allow this
> Two points. Firstly there is a big difference between a character being
> allowed and the character actually being used in practice. Are you
> saying that in practice anyone anywhere is putting spaces, single quotes
> or double quotes within a Java EE and JBoss module name?  If the answer
> is yes then how common would that be?

I don't have metrics, of course; one only needs to know the API contract 
to know that this is allowed.  But it is completely irrelevant to the 
discussion of the requirement at any rate because you're confusing a few 
things which I'll outline down below.

> Also, do Java EE and JBoss module names currently allow the characters
> in the range 0x00 to 0x1F?  If the answer is yes then 100% compatibility
> with Java 9 module names is already gone.

Yes it's allowed (not 0x00 but the others are), but I don't think there 
is any practical way to actually accomplish injecting most of those 
values (at least in a Java EE situation), nor have I ever seen it happen 
in practice.

> The second point is that we are now talking about Java 9 module names
> being embedded as identifiers within Java class files where they will
> directly affect downstream users.  This was not the case with Java EE
> and JBoss Module names.  This is a much, much bigger deal.

No, that's not how it works at all.  Java source modules can only 
reference other Java source modules; nobody is going to be shocked that 
you can't reference a manually-built module from a Java source module. 
There's nothing to affect downstream users.  Only container code will be 
creating or referencing such modules.

>> There has been plenty of serious thought
> In the jpms-spec-experts list it is suggested that "for sanity" Java 9
> module names should not contain "any character whose Unicode code point
> is less than 0x20".  Yet the DEL (0x127) character is allowed?
>
> Taking a step backwards, it would appear that it was never considered
> that Java 9 module names might need be be specified as identifiers in
> existing bytecode processing scripting languages.  That is 100%
> understandable.  However, now that it is known that that is the case,
> doesn't it make sense "for sanity" to not allow a Java 9 module name to
> contain spaces, single quotes, double quotes, semicolons or asterixes?

 From the perspective of container code, those "sanity" characters are 
selected completely arbitrarily.  They have various usages in various 
contexts, but any given module layer implementation doesn't necessarily 
align with any of those contexts.  For example there may be different 
characters which don't make sense, while some of the examples you listed 
do.  So it makes the most sense to allow anything at the bytecode level, 
and rely on the module layer implementation to apply the appropriate policy.

> The problem with spaces, single quotes, double quotes in an identifier
> that needs to be parsed from a text file is obvious.

Sure, but the lenient rules only apply to class files that were manually 
generated.  There are zero cases where one would have to parse an 
arbitrary module identifier from a text file; every module layer 
provider is going to have their own syntax and naming policy.

> As mentioned
> earlier, the problem with semicolons is that they are commonly used as
> terminators in the scripting languages which nearly always use a Java
> style syntax.  The problem with asterixes is that they are commonly used
> as a wildcard character in identifiers in the languages.

In the language, the module identifier is bounded by quite strict 
syntax.  Modules which are distributed for downstream consumption will 
likewise adhere to these criteria (they have no way to avoid it short of 
weird bytecode hacking).  The only time the more general rules come into 
play is when modules are being generated at run time from other module 
systems and setups.

Let me put it another way.

Every module system has its own rules and restrictions for how a module 
can be named.  Those restrictions do not all exactly align.  If you ban 
the union of all the disallowed characters in all module systems, then 
all module systems will break.  However if you only ban the intersection 
of such systems (i.e. control characters), and allow each layer to 
enforce its own policy, then every system will work and interoperate as 
expected.  There is no downside because anyone who "cleverly" hacks 
bytecode to produce a vanilla module with an invalid name will soon 
realize that their module can never be found.  A user has to go to 
extraordinary measures to do so, so there is very little risk of such a 
thing happening nor is there a risk of it impacting users in any 
relevant way.

Everyone has their own notion of what "offensive" characters would be. 
But enforcing these rules is and can only be the job of the layer provider.

> On Wed, Jan 4, 2017 at 2:04 AM, David M. Lloyd <david.lloyd at redhat.com
> <mailto:david.lloyd at redhat.com>> wrote:
>
>     On 01/01/2017 07:44 PM, Ess Kay wrote:
>
>         Hi Rémi,
>
>             You can update your tool to use an escape character
>
>         Sure. However, can you imagine how much work it would be to
>         update a Java
>         source parser to allow identifiers like package and class names
>         to contain
>         escaped semi-colons, single quotes or double quotes? My scenario
>         and that
>         of many others is the same.  It can be done but the result will
>         be ugly.
>
>         I repeat my earlier question, are there existing module systems
>         out there
>         that allow spaces, quotes and semi colons to appear in a module
>         name?
>
>
>     Yes.  Java EE and JBoss Modules both allow this, as do systems where
>     a file name is a module name.
>
>         All I ask is that serious thought be given to how much
>         flexibility is
>         really needed in a module name.  There are signs that there has
>         not yet
>         been much serious thought.  For example backspace is not allowed
>         but DEL
>         (0x127) is allowed.
>
>
>     There has been plenty of serious thought, and I agree that we should
>     be disallowing all Unicode controls of any kind, but my
>     understanding is that there are implementation complexities involved
>     which make this somehow impractical.  However UTF-8 parsing is not
>     difficult so hopefully this can be revisited at some point.
>
>     --
>     - DML
>
>

-- 
- DML


More information about the jigsaw-dev mailing list