[icedtea-web] RFC: use tagsoup to try and parse malformed JNLP files

Omair Majid omajid at redhat.com
Mon Jan 10 16:10:16 PST 2011


On 01/10/2011 05:33 PM, Dr Andrew John Hughes wrote:
> On 16:39 Mon 10 Jan     , Omair Majid wrote:
>> Hi,
>>
>> I have come across a number of JNLP files that are not valid xml. Netx
>> can not parse these files using a xml parser, and fails to run them. I
>> spent some time looking for a solution and came across TagSoup[1]. The
>> TagSoup library parses a malformed HTML document into a well-formed
>> xml-like HTML document, but it works almost perfectly for our purposes too.
>>
>> The attached patch makes use of TagSoup for parsing input jnlp files.
>>
>> Parsing is currently implemented in two passes. In the first pass,
>> TagSoup reads the "xml" (which can be malformed and hence not really
>> xml), and outputs valid XML. Netx then uses this valid XML and uses it's
>> own XML parser to parse the file.
>>
>> The patch requires TagSoup as an optional dependency. To use TagSoup,
>> run configure (--with-tagsoup can be used to point to a TagSoup jar). To
>> not use TagSoup (even if it installed), use --with-tagsoup=no
>>
>> The patch also adds an additional command line option, -xml ,to the
>> javaws binary. This option can be used to force Netx to use the normal
>> xml parser instead of TagSoup to parse the jnlp file.
>>
>> Any thoughts or comments?
>>
>> ChangeLog:
>> 2011-01-10  Omair Majid<omajid at redhat.com>
>>
>>       * Makefile.am: Add NETX_EXCLUDE_SRCS, NETX_DUMMY_CLASSPATH
>>       (netx-source-files.txt): Selectively exclude some sources from
>>       compilation.
>>       (stamps/netx.stamp): Depend on netx-dummy.jar
>>       (netx-dummy.jar): New target. Empty jar. Used so there is always at
>>       least one class on the classpath.
>>       ($(NETX_DIR)/launcher/%.o): Add classpath.
>>       * NEWS: Update with fix.
>>       * acinclude.m4: Add IT_CHECK_FOR_TAGSOUP.
>>       * configure.ac: Call IT_CHECK_FOR_TAGSOUP.
>>       * netx/net/sourceforge/jnlp/JNLPFile.java: Add new member
>>       parserSettings.
>>       (JNLPFile(URL)): Pass a ParserSettings object.
>>       (JNLPFile(URL,boolean)): Refactored into...
>>       (JNLPFile(URL,ParserSettings)): New method.
>>       (JNLPFile(URL,Version,boolean)): Refactored into...
>>       (JNLPFile(URL,Version,ParserSettings)): New method.
>>       (JNLPFile(URL,Version,boolean,UpdatePolicy)): Refactored into...
>>       (JNLPFile(URL,Version,ParserSettings,UpdatePolicy)): New method.
>>       (JNLPFile(URL,String,Version,boolean,UpdatePolicy)): Refactored
>>       into...
>>       (JNLPFile(URL,String,Version,ParserSettings,UpdatePolicy)): New
>>       method.
>>       (JNLPFile(InputStream,boolean)): Refactored into...
>>       (JNLPFile(InputStream,ParserSettings)): New method.
>>       (getParserSettings): New method.
>>       (parse(Node,boolean,URL)): Refactored into...
>>       (parse(InputStream,URL)): New method. Invoke parser to get the root
>>       node and then parse it.
>>       * netx/net/sourceforge/jnlp/Launcher.java
>>       (toFile): Use new ParserSettings object.
>>       * netx/net/sourceforge/jnlp/Parser.java
>>       (Parser(JNLPFile,URL,Node,boolean,boolean)): Refactored into...
>>       (Parser(JNLPFile,URL,Node,ParserSettings)): New method.
>>       (getRootNode): Implementation moved into XMLParser.getRootNode.
>>       Selects the right subclass of XMLParser to use.
>>       (getEncoding): Moved to XMLParser.
>>       * netx/net/sourceforge/jnlp/ParserSettings.java: New file.
>>       (ParserSettings): New method.
>>       (ParserSettings(boolean,boolean,boolean)): New method.
>>       (isExtensionAllowed): New method.
>>       (isMalfromedXmlAllowed): New method.
>>       (isStrict): New method.
>>       * netx/net/sourceforge/jnlp/XMLParser.java
>>       (getRootNode): New method. Contains implementation from
>>       Parser.getRootNode.
>>       (getEncoding): New method. Moved from Parser.
>>       * netx/net/sourceforge/jnlp/MalformedXMLParser.java: New file.
>>       (getRootNode): New method. Transform input into valid xml and
>>       delegate to parent to parse it.
>>       (xmlizeInputStream): New method. Read contents from an input stream
>>       and transform it into valid xml.
>>       * netx/net/sourceforge/jnlp/resources/Messages.properties: Add
>>       BOXml.
>>       * netx/net/sourceforge/jnlp/runtime/Boot.java: Add -xml option.
>>       (getFile): Parse -xml option and create a new ParserSettings object
>>       based on it.
>>       * netx/net/sourceforge/jnlp/runtime/JNLPClassLoader.java
>>       (getInstance(URL,String,Version,UpdatePolicy)): Refactored into...
>>       (getInstance(URL,String,Version,ParserSettings,UpdatePolicy): New
>>       method.
>>       (initializeExtensions): Use the same parser settings to parse the
>>       extension as used in the original file.
>>
>> Cheers,
>> Omair
>>
>> [1] http://home.ccil.org/~cowan/XML/tagsoup/
>
> I've just looked at the build changes.  I'll leave someone with better knowledge
> of the source code to look at those changes.
>

Thanks for looking over the changes so quickly!

> With Makefile.am, I don't see why NETX_DUMMY_CLASSPATH is needed or the additional
> rule that creates a JAR file.  Neither do you need to set NETX_EXCLUDE_SRCS to empty;
> this is the default.
>

Automake complains if a variable is not set before using += :

NETX_EXCLUDE_SRCS must be set with `=' before using `+='

> if HAVE_TAGSOUP
> NETX_CLASSPATH_ARG=-classpath $(TAGSOUP_JAR)
> NETX_LAUNCHER_ARG="-Xbootclasspath/a:$(TAGSOUP_JAR)"
> else
> NETX_EXCLUDE_SRCS+=net.sourceforge.jnlp.MalformedXMLParser.java
> endif
>
> would work fine and you can drop the netx-dummy.jar rule.
>

Thanks for the idea. What I wanted to do (a little prematurely, I 
suppose) was to make sure that more dependencies could be added in the 
future (with their own configure flags, if necessary) without changing 
the code too much. I also wanted all build code-paths to be as close as 
possible. Which is why I wanted to always have a classpath for 
netx-building (even if it was effectively blank using netx-dummy.jar) 
But my approach just makes the Makefile look like a mess.

> Should we really be putting tagsoup on the bootclasspath? What's wrong with the classpath?
>

I have tested it out now with classpath and it looks like the javaws 
launcher does not like it:
$ javaws XEtchedButtonDemo.jnlp
Unrecognized option: -classpath /usr/share/java/tagsoup.jar
Could not create the Java virtual machine.

There is probably a way around this, I will see if I can find it.

> As to excluding the file, have you tested this?  Are you sure no other Java files pull
> that class in?
>

Yup. This is one code path I made sure to test. MalformedXMLParser is a 
new file I added in this patch. The class is never used directly. Only 
net.sourceforge.jnlp.Parser uses it, and that too through reflection. 
Building (and running) without tagsoup works just fine.

> For configure, the argument should be the path to the jar file.  Otherwise, the JAR file
> always has to be 'tagsoup.jar' which may not be the case.
>

Isnt this already the case? Perhaps I missed something, but the code 
does this: if --with-tagsoup=no then HAVE_TAGSOUP is set to false. if 
--with-tagsoup=somevar then somevar is used as the location of the 
tagsoup.jar. If --with-tagsoup is not used, then /usr/share/java (and 
other locations) are searched for a tagsoup.jar.

> You should also check /usr/share/tagsoup/lib/tagsoup.jar which is the Gentoo installation path.
> Debian uses /usr/share/java/tagsoup.jar as already checked.
>

Ah, thanks. Updated patch attached.

Cheers,
Omair
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tagsoup-02.patch
Type: text/x-patch
Size: 35339 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/distro-pkg-dev/attachments/20110110/87bbd05a/tagsoup-02.patch 


More information about the distro-pkg-dev mailing list