[icedtea-web] RFC: use tagsoup to try and parse malformed JNLP files
Omair Majid
omajid at redhat.com
Mon Jan 10 16:10:16 PST 2011
On 01/10/2011 05:33 PM, Dr Andrew John Hughes wrote:
> On 16:39 Mon 10 Jan , Omair Majid wrote:
>> Hi,
>>
>> I have come across a number of JNLP files that are not valid xml. Netx
>> can not parse these files using a xml parser, and fails to run them. I
>> spent some time looking for a solution and came across TagSoup[1]. The
>> TagSoup library parses a malformed HTML document into a well-formed
>> xml-like HTML document, but it works almost perfectly for our purposes too.
>>
>> The attached patch makes use of TagSoup for parsing input jnlp files.
>>
>> Parsing is currently implemented in two passes. In the first pass,
>> TagSoup reads the "xml" (which can be malformed and hence not really
>> xml), and outputs valid XML. Netx then uses this valid XML and uses it's
>> own XML parser to parse the file.
>>
>> The patch requires TagSoup as an optional dependency. To use TagSoup,
>> run configure (--with-tagsoup can be used to point to a TagSoup jar). To
>> not use TagSoup (even if it installed), use --with-tagsoup=no
>>
>> The patch also adds an additional command line option, -xml ,to the
>> javaws binary. This option can be used to force Netx to use the normal
>> xml parser instead of TagSoup to parse the jnlp file.
>>
>> Any thoughts or comments?
>>
>> ChangeLog:
>> 2011-01-10 Omair Majid<omajid at redhat.com>
>>
>> * Makefile.am: Add NETX_EXCLUDE_SRCS, NETX_DUMMY_CLASSPATH
>> (netx-source-files.txt): Selectively exclude some sources from
>> compilation.
>> (stamps/netx.stamp): Depend on netx-dummy.jar
>> (netx-dummy.jar): New target. Empty jar. Used so there is always at
>> least one class on the classpath.
>> ($(NETX_DIR)/launcher/%.o): Add classpath.
>> * NEWS: Update with fix.
>> * acinclude.m4: Add IT_CHECK_FOR_TAGSOUP.
>> * configure.ac: Call IT_CHECK_FOR_TAGSOUP.
>> * netx/net/sourceforge/jnlp/JNLPFile.java: Add new member
>> parserSettings.
>> (JNLPFile(URL)): Pass a ParserSettings object.
>> (JNLPFile(URL,boolean)): Refactored into...
>> (JNLPFile(URL,ParserSettings)): New method.
>> (JNLPFile(URL,Version,boolean)): Refactored into...
>> (JNLPFile(URL,Version,ParserSettings)): New method.
>> (JNLPFile(URL,Version,boolean,UpdatePolicy)): Refactored into...
>> (JNLPFile(URL,Version,ParserSettings,UpdatePolicy)): New method.
>> (JNLPFile(URL,String,Version,boolean,UpdatePolicy)): Refactored
>> into...
>> (JNLPFile(URL,String,Version,ParserSettings,UpdatePolicy)): New
>> method.
>> (JNLPFile(InputStream,boolean)): Refactored into...
>> (JNLPFile(InputStream,ParserSettings)): New method.
>> (getParserSettings): New method.
>> (parse(Node,boolean,URL)): Refactored into...
>> (parse(InputStream,URL)): New method. Invoke parser to get the root
>> node and then parse it.
>> * netx/net/sourceforge/jnlp/Launcher.java
>> (toFile): Use new ParserSettings object.
>> * netx/net/sourceforge/jnlp/Parser.java
>> (Parser(JNLPFile,URL,Node,boolean,boolean)): Refactored into...
>> (Parser(JNLPFile,URL,Node,ParserSettings)): New method.
>> (getRootNode): Implementation moved into XMLParser.getRootNode.
>> Selects the right subclass of XMLParser to use.
>> (getEncoding): Moved to XMLParser.
>> * netx/net/sourceforge/jnlp/ParserSettings.java: New file.
>> (ParserSettings): New method.
>> (ParserSettings(boolean,boolean,boolean)): New method.
>> (isExtensionAllowed): New method.
>> (isMalfromedXmlAllowed): New method.
>> (isStrict): New method.
>> * netx/net/sourceforge/jnlp/XMLParser.java
>> (getRootNode): New method. Contains implementation from
>> Parser.getRootNode.
>> (getEncoding): New method. Moved from Parser.
>> * netx/net/sourceforge/jnlp/MalformedXMLParser.java: New file.
>> (getRootNode): New method. Transform input into valid xml and
>> delegate to parent to parse it.
>> (xmlizeInputStream): New method. Read contents from an input stream
>> and transform it into valid xml.
>> * netx/net/sourceforge/jnlp/resources/Messages.properties: Add
>> BOXml.
>> * netx/net/sourceforge/jnlp/runtime/Boot.java: Add -xml option.
>> (getFile): Parse -xml option and create a new ParserSettings object
>> based on it.
>> * netx/net/sourceforge/jnlp/runtime/JNLPClassLoader.java
>> (getInstance(URL,String,Version,UpdatePolicy)): Refactored into...
>> (getInstance(URL,String,Version,ParserSettings,UpdatePolicy): New
>> method.
>> (initializeExtensions): Use the same parser settings to parse the
>> extension as used in the original file.
>>
>> Cheers,
>> Omair
>>
>> [1] http://home.ccil.org/~cowan/XML/tagsoup/
>
> I've just looked at the build changes. I'll leave someone with better knowledge
> of the source code to look at those changes.
>
Thanks for looking over the changes so quickly!
> With Makefile.am, I don't see why NETX_DUMMY_CLASSPATH is needed or the additional
> rule that creates a JAR file. Neither do you need to set NETX_EXCLUDE_SRCS to empty;
> this is the default.
>
Automake complains if a variable is not set before using += :
NETX_EXCLUDE_SRCS must be set with `=' before using `+='
> if HAVE_TAGSOUP
> NETX_CLASSPATH_ARG=-classpath $(TAGSOUP_JAR)
> NETX_LAUNCHER_ARG="-Xbootclasspath/a:$(TAGSOUP_JAR)"
> else
> NETX_EXCLUDE_SRCS+=net.sourceforge.jnlp.MalformedXMLParser.java
> endif
>
> would work fine and you can drop the netx-dummy.jar rule.
>
Thanks for the idea. What I wanted to do (a little prematurely, I
suppose) was to make sure that more dependencies could be added in the
future (with their own configure flags, if necessary) without changing
the code too much. I also wanted all build code-paths to be as close as
possible. Which is why I wanted to always have a classpath for
netx-building (even if it was effectively blank using netx-dummy.jar)
But my approach just makes the Makefile look like a mess.
> Should we really be putting tagsoup on the bootclasspath? What's wrong with the classpath?
>
I have tested it out now with classpath and it looks like the javaws
launcher does not like it:
$ javaws XEtchedButtonDemo.jnlp
Unrecognized option: -classpath /usr/share/java/tagsoup.jar
Could not create the Java virtual machine.
There is probably a way around this, I will see if I can find it.
> As to excluding the file, have you tested this? Are you sure no other Java files pull
> that class in?
>
Yup. This is one code path I made sure to test. MalformedXMLParser is a
new file I added in this patch. The class is never used directly. Only
net.sourceforge.jnlp.Parser uses it, and that too through reflection.
Building (and running) without tagsoup works just fine.
> For configure, the argument should be the path to the jar file. Otherwise, the JAR file
> always has to be 'tagsoup.jar' which may not be the case.
>
Isnt this already the case? Perhaps I missed something, but the code
does this: if --with-tagsoup=no then HAVE_TAGSOUP is set to false. if
--with-tagsoup=somevar then somevar is used as the location of the
tagsoup.jar. If --with-tagsoup is not used, then /usr/share/java (and
other locations) are searched for a tagsoup.jar.
> You should also check /usr/share/tagsoup/lib/tagsoup.jar which is the Gentoo installation path.
> Debian uses /usr/share/java/tagsoup.jar as already checked.
>
Ah, thanks. Updated patch attached.
Cheers,
Omair
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tagsoup-02.patch
Type: text/x-patch
Size: 35339 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/distro-pkg-dev/attachments/20110110/87bbd05a/tagsoup-02.patch
More information about the distro-pkg-dev
mailing list