RFR (JAXP): 8035469 : Xerces Update: EncodingMap does not recognise Java-style encodings Cp1141-Cp1149

Mon Mar 3 16:58:21 UTC 2014

Hi Alan and all,

Thanks for bringing up the testing issue.

For the Xerces Update project, I'd like to share our process for adding 
or creating tests. The basis of the process is that, with understanding 
that the Xerces release have been out for over 3 years and patches were 
generally verified by users, we trust the quality of the source. We will 
carefully verify the changes, run regression tests, and add additional 
tests where necessary.

1) If tests were attached to a fixed issue, they should be brought in, 
converted to OpenJDK format;

2) If tests were described in the original bug report, create a test 
based upon it;

3) Create tests or verify existing tests if there is a conflict, or 
overlapping changeset while merging the sources;

4) Create tests if compatibility is a concern

5) If there was no test in Xerces report, we may skip adding new tests 
if changes were minor or obvious (e.g. typo).

I would think JDK-8035469 falls into the last category. In terms of 
improving the coverage of encoding mapping, we can bring it up with SQE 
team.

Thanks,
Joe

On 3/1/2014 10:12 AM, David Li wrote:
> Joe probably knows more about this, but we did some preliminary 
> investigation summarized below.
>
> One test that was considered was creating an XML file encoded in one 
> of the formats and then seeing if the parser would process the file 
> after our updates were added.  This looked like it requires generating 
> sample XML files with characters from the actual encoding, which we 
> could not figure out in a reasonable amount of time.  It's not 
> sufficient to specify the encoding in the XML header (<?xml 
> version=\"1.0\" encoding=\"CP1140\"?>, also tried IBM01140) if all the 
> text in the file is UTF-8, since the parser complains.  It was decided 
> that since the changes were minor, and the original Xerces bug did not 
> include any tests or any way of reproducing the error, we would not 
> spend too much time on the issue.  For reference, the 
> IBM01140-IBM01149 encodings look like various European languages: 
> http://www.iana.org/assignments/character-sets/character-sets.xhtml.
>
> - David
>
> On 3/1/2014 1:06 AM, Alan Bateman wrote:
>> On 28/02/2014 22:11, David Li wrote:
>>> Hi,
>>>
>>> This is an update from Xerces for a fixed encoding map entry in file 
>>> EncodingMap.java.  For details, please refer to: 
>>> https://bugs.openjdk.java.net/browse/JDK-8035469
>>>
>>> Webrevs: http://cr.openjdk.java.net/~joehw/jdk9/8035469/webrev/
>>> (I don't have a openjdk username yet, so Joe Wang uploaded it)
>>>
>>> No new tests since the change is minor.  There were no tests from 
>>> Apache fixes.
>> Maybe this is a question for Joe but I wonder if it would be possible 
>> to create a test that exercises these encodings? I realize the change 
>> is minor but it is also subtle and this maybe be an area where we 
>> should have better tests.
>>
>> -Alan
>