Wrong encoding after XML identity transformation

Nico R. n-roeser at gmx.net
Sat Mar 29 16:01:12 UTC 2014


Hello!

I’m using IcedTea 2.4.5 for OpenJDK 7 on Gentoo. This includes JAXP
revision 8fe156ad49e2 (in the IcedTea repos) which again seems to
contain jdk7u51-b31[0] – which again is revision 626e76f127a4 in the
OpenJDK repo jdk7u, as far as I can see. Oh well, you’ll probably know
better about all these version numbers than I do.

Anyway, after a painful debugging session I found that the default XML
transformer implementation (via XSLTC) handles encodings improperly when
writing in-memory DOM Documents (which had an encoding other than UTF-8
specified when being parsed) to a stream.


I’m attaching my test code, which I hope is correct and readable. What
it does:

• read a document with encoding="ISO-8859-1" from an input stream into a
DOM Document. The input document itself does not contain any characters
outside US-ASCII, which is a subset of ISO-8859-1.

• Add a text node with text “schön” (=nice in German) to the document.
The “ö” in “schön” is LATIN SMALL LETTER O WITH DIAERESIS (U+00F6). This
can, of course, be stored in the in-memory document tree, but may need
character conversions when storing it later.

• Use a Transformer with output properties set to XML in UTF-8 for
writing the document into a stream using an identity transformation.


I compared Xalan-J 2.7.1 and the internal implementation (older Xalan?)
in my JRE installed with my version of OpenJDK (see above). External
Xalan produces documents with XML encoding="UTF-8", while the
JRE-internal Xalan keeps encoding="ISO-8859-1", *but writes the “ö”
encoded in UTF-8*! This produces wrong content in the document when
processing it with an XML parser later.

The transformer should use UTF-8, as I requested in the code. If I did
not specificially request anything, it might also have used ISO-8859-1
if transcoding all characters into that encoding.

In order to use the attached test program, put xalan.jar and xsltc.jar
from Xalan-J into your classpath. Even XSLTC from Xalan-J 2.7.1 works,
just not the JRE-internal one.

My default locale has UTF-8 encoding, in case that matters.


[0]
http://icedtea.classpath.org/hg/release/icedtea7-forest-2.4/jaxp/rev/8fe156ad49e2
[1] http://hg.openjdk.java.net/jdk7u/jdk7u/rev/a831c212ee26
-- 
Nico



More information about the core-libs-dev mailing list