RFR: 6928542: Chinese characters in RTF are not decoded [v7]
Prasanta Sadhukhan
psadhukhan at openjdk.org
Thu Oct 26 12:47:35 UTC 2023
On Thu, 21 Sep 2023 16:21:05 GMT, Ichiroh Takiguchi <itakiguchi at openjdk.org> wrote:
>> "character set of font" (font charset) table was created by "Rich Text Format Specification 1.9.1"
>> https://interoperability.blob.core.windows.net/files/Archive_References/[MSFT-RTF].pdf
>> It refers windgi.h
>> https://learn.microsoft.com/en-us/windows/win32/api/wingdi/ns-wingdi-textmetrica
>>
>> Test files and testcase are in bugid [JDK-6928542](https://bugs.openjdk.org/browse/JDK-6928542)
>>
>> Additional change:
>> Special character `\line` should `\n`
>>
>> Additional information:
>>
>> Add 2 hash tables
>> - fcharsetToCP: Predefined conversion table, `fcharset` with number control word, from control word to Java charset name, `fcharset0` refers `windows-1252` Java charset name
>> - fcharsetTable: Conversion table for each RTF file, `f` control word with number, from integer font numbers to Charset font charsets, In case of `{\f0\fnil\fcharset0 Segoe UI;}`, `0` refers Java Charset `windows-1252`
>>
>> When RTF Character Set control word (like `\mac`) is used, unmappable character returns \u0000 and it's not written into RTF text..
>> When fcharset control word is used, unmappable character returns \uFFFD (it's the same as replacement character on decoder), \u0000 is used for DBCS lead byte detection.
>> If `f` or `par` control word is there and lead byte is remains on byte buffer for decoder, this byte data is as invalid character and write \uFFFD into RTF text.
>>
>> If `f` control word is used without `fcharset`, `translationTable` char array is used.
>> If `f` control word is used with `fcharset`, predefined Java Charset name is used (if missing, ISO8859_1 is used for fallback).
>>
>> **Note:** Following GitHub actions were failed
>> linux-cross-compile / build (riscv64), I opened following JBS.
>>> [JDK-8314624](https://bugs.openjdk.org/browse/JDK-8314624) GHA: RISC-V cross-build was failed
>
> Ichiroh Takiguchi has updated the pull request incrementally with one additional commit since the last revision:
>
> 6928542: Chinese characters in RTF are not decoded
For me the added regression test still fails with the fix in WIndows 10...anything I need to do more as a prerequisite?
Read data^M
=========^M
Gr\\u00fcezi - Switzerland 0^M
\\u0082\\u00b1\\u0082\\u00f1\\u0082\\u00c9\\u0082\\u00bf\\u0082\\u00cd - Japanese 128^M
\\u00be\\u00c8\\u00b3\\u00e7\\u00c7\\u00cf\\u00bc\\u00bc\\u00bf\\u00e4 - Korean 129^M
\\u00c4\\u00e3\\u00ba\\u00c3 - China 134^M
\\u00bbO\\u00c6W - Traditional Chinese - Taiwan 136^M
\\u00e3\\u00e5\\u00e9\\u00e1 \\u00f3\\u00ef\\u00f5 - Greek 161^M
A\\u00f0a\\u00e7 - Turkish (Tree) 162^M
\\u00fe - Vietnam currency 163^M
\\u00f9\\u00c8\\u00d1\\u00ec\\u00e5\\u00c9\\u00ed - Hebrew 177^M
\\u00e3\\u00d1\\u00cd\\u00c8\\u00c7 - Arabic 178^M
A\\u00e8i\\u00fb - Lithuanian (Thank you) 186^M
\\u00c7\\u00e4\\u00f0\\u00e0\\u00e2\\u00f1\\u00f2\\u00e2\\u00f3\\u00e9\\u00f2\\u00e5 - Russian 204^M
\\u00ca\\u00c7\\u00d1\\u00ca\\u00b4\\u00d5 - Thailand 222^M
cze\\uc48f - Polish 238^M
^M
Expected data^M
=============^M
Gr\\u00fcezi - Switzerland 0^M
\\u3053\\u3093\\u306b\\u3061\\u306f - Japanese 128^M
\\uc548\\ub155\\ud558\\uc138\\uc694 - Korean 129^M
\\u4f60\\u597d - China 134^M
\\u81fa\\u7063 - Traditional Chinese - Taiwan 136^M
\\u03b3\\u03b5\\u03b9\\u03b1 \\u03c3\\u03bf\\u03c5 - Greek 161^M
A\\u011fa\\u00e7 - Turkish (Tree) 162^M
\\u20ab - Vietnam currency 163^M
\\u05e9\\u05b8\\u05c1\\u05dc\\u05d5\\u05b9\\u05dd - Hebrew 177^M
\\u0645\\u0631\\u062d\\u0628\\u0627 - Arabic 178^M
A\\u010di\\u016b - Lithuanian (Thank you) 186^M
\\u0417\\u0434\\u0440\\u0430\\u0432\\u0441\\u0442\\u0432\\u0443\\u0439\\u0442\\u0435 - Russian 204^M
\\u0e2a\\u0e27\\u0e31\\u0e2a\\u0e14\\u0e35 - Thailand 222^M
cze\\u015b\\u0107 - Polish 238^M
^M
java.lang.RuntimeException: Test failed^M
at RTFReadFontCharsetTest.main(RTFReadFontCharsetTest.java:114)^
-------------
PR Comment: https://git.openjdk.org/jdk/pull/13553#issuecomment-1781050285
More information about the client-libs-dev
mailing list