RFR: 8354266: Fix non-UTF-8 text encoding

Fri Apr 11 10:27:40 UTC 2025

On Fri, 11 Apr 2025 03:35:11 GMT, Sergey Bylokhov <serb at openjdk.org> wrote:

>> I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. 
>> 
>> BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed.
>> 
>> Methodology used: 
>> 
>> I have run four different tools for using different heuristics for determining the encoding of a file:
>> * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5)
>> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
>> * enca (targeted towards obscure code pages)
>> * libmagic / `file  --mime-encoding`
>> 
>> They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further:
>> * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
>> 
>> From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets:
>> * All files where at least one tool claimed it to be UTF-8
>> * All files where at least one tool claimed it to be *not* UTF-8
>> 
>> For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay.
>> 
>> For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure...
>
> src/demo/share/java2d/J2DBench/resources/textdata/arabic.ut8.txt line 11:
> 
>> 9: ØªØ®ØµØµ Ø§Ù„Ø´ÙØ±Ø© Ø§Ù„Ù…ÙˆØØ¯Ø© "ÙŠÙˆÙ†ÙÙƒÙˆØ¯" Ø±Ù‚Ù…Ø§ ÙˆØÙŠØ¯Ø§ Ù„ÙƒÙ„ Ù…ØØ±Ù ÙÙŠ Ø¬Ù…ÙŠØ¹ Ø§Ù„Ù„ØºØ§Øª Ø§Ù„Ø¹Ø§Ù„Ù…ÙŠØ©ØŒ ÙˆØ°Ù„Ùƒ Ø¨ØºØ¶ Ø§Ù„Ù†Ø¸Ø± Ø¹Ù† Ù†ÙˆØ¹ Ø§Ù„ØØ§Ø³ÙˆØ¨ Ø£Ùˆ Ø§Ù„Ø¨Ø±Ø§Ù…Ø¬ Ø§Ù„Ù…Ø³ØªØ®Ø¯Ù…Ø©. ÙˆÙ‚Ø¯ ØªÙ€Ù… ØªØ¨Ù†ÙŠ Ù…ÙˆØ§ØµÙØ© "ÙŠÙˆÙ†ÙÙƒÙˆØ¯" Ù…Ù€Ù€Ù† Ù‚Ø¨Ù€Ù„ Ù‚Ø§Ø¯Ø© Ø§Ù„ØµØ§Ù†Ø¹ÙŠÙ† Ù„Ø£Ù†Ø¸Ù…Ø© Ø§Ù„ØÙˆØ§Ø³ÙŠØ¨ ÙÙ€ÙŠ Ø§Ù„Ø¹Ø§Ù„Ù…ØŒ Ù…Ø«Ù„ Ø´Ø±ÙƒØ§Øª Ø¢ÙŠ.Ø¨ÙŠ.Ø¥Ù…. (IBM)ØŒ Ø£Ø¨Ù€Ù„ (APPLE)ØŒ Ù‡ÙÙŠÙ€Ù’ÙˆÙ„ÙÙ€Øª Ø¨Ù€Ø§ÙƒÙ€Ø±Ø¯ (Hewlett-Packard) ØŒ Ù…Ø§ÙŠÙƒØ±ÙˆØ³ÙˆÙØª (Microsoft)ØŒ Ø£ÙˆØ±Ø§ÙƒÙÙ€Ù„ (Oracle) ØŒ ØµÙ† (Sun) ÙˆØºÙŠØ±Ù‡Ø§. ÙƒÙ…Ø§ Ø£Ù† Ø§Ù„Ù…ÙˆØ§ØµÙØ§Øª ÙˆØ§Ù„Ù…Ù‚Ø§ÙŠÙŠØ³ Ø§Ù„ØØ¯ÙŠØ«Ø© (Ù…Ø«Ù„ Ù„ØºØ© Ø§Ù„Ø¨Ø±Ù…Ø¬Ø© "Ø¬Ø§ÙØ§" "JAVA" ÙˆÙ„ØºØ© "Ø¥ÙƒØ³ Ø¥Ù… Ø¥Ù„" "XML" Ø§Ù„ØªÙŠ ØªØ³ØªØ®Ø¯Ù… Ù„Ø¨Ø±Ù…Ø¬Ø© Ø§Ù„Ø§Ù†ØªØ±Ù†ÙŠØª) ØªØªØ·Ù„Ø¨ Ø§Ø³ØªØ®Ø¯Ø§Ù… "ÙŠÙˆÙ†ÙÙƒÙˆØ¯". Ø¹Ù„Ø§ÙˆØ© Ø¹Ù„Ù‰ Ø°Ù„Ùƒ ØŒ ÙØ¥Ù† "ÙŠÙˆÙ†ÙÙƒÙˆØ¯" Ù‡ÙŠ Ø§Ù„Ø·Ù€Ø±ÙŠÙ€Ù‚Ù€Ø© Ø§Ù„Ø±Ø³Ù€Ù…ÙŠØ© Ù„ØªØ·Ø¨ÙŠÙ‚ Ø§Ù„Ù…Ù‚ÙŠÙ€Ø§Ø³ Ø§Ù„Ù€Ø¹Ù€Ø§Ù„Ù€Ù…ÙŠ Ø¥ÙŠØ²Ùˆ Ù¡Ù Ù
 ¦Ù¤Ù¦  (ISO 10646) .
>> 10: 
>> 11: Ø¥Ù† Ø¨Ø²ÙˆØº Ù…ÙˆØ§ØµÙØ© "ÙŠÙˆÙ†ÙÙƒÙˆØ¯" ÙˆØªÙˆÙÙ‘ÙØ± Ø§Ù„Ø£Ù†Ø¸Ù…Ø© Ø§Ù„ØªÙŠ ØªØ³ØªØ®Ø¯Ù…Ù‡ ÙˆØªØ¯Ø¹Ù…Ù‡ØŒ ÙŠØ¹ØªØ¨Ø± Ù…Ù† Ø£Ù‡Ù… Ø§Ù„Ø§Ø®ØªØ±Ø§Ø¹Ø§Øª Ø§Ù„ØØ¯ÙŠØ«Ø© ÙÙŠ Ø¹ÙˆÙ„Ù…Ø© Ø§Ù„Ø¨Ø±Ù…Ø¬ÙŠØ§Øª Ù„Ø¬Ù…ÙŠØ¹ Ø§Ù„Ù„ØºØ§Øª ÙÙŠ Ø§Ù„Ø¹Ø§Ù„Ù…. ÙˆØ¥Ù† Ø§Ø³ØªØ®Ø¯Ø§Ù… "ÙŠÙˆÙ†ÙÙƒÙˆØ¯" ÙÙŠ Ø¹Ø§Ù„Ù… Ø§Ù„Ø§Ù†ØªØ±Ù†ÙŠØª Ø³ÙŠØ¤Ø¯ÙŠ Ø¥Ù„Ù‰ ØªÙˆÙÙŠØ± ÙƒØ¨ÙŠØ± Ù…Ù‚Ø§Ø±Ù†Ø© Ù…Ø¹ Ø§Ø³ØªØ®Ø¯Ø§Ù… Ø§Ù„Ù…Ø¬Ù…ÙˆØ¹Ø§Øª Ø§Ù„ØªÙ‚Ù„ÙŠØ¯ÙŠØ© Ù„Ù„Ù…ØØ§Ø±Ù Ø§Ù„Ù…Ø´ÙØ±Ø©. ÙƒÙ…Ø§ Ø£Ù† Ø§Ø³ØªØ®Ø¯Ø§Ù… "ÙŠÙˆÙ†ÙÙƒÙˆØ¯" Ø³ÙŠÙÙ…ÙƒÙ‘ÙÙ† Ø§Ù„Ù…Ø¨Ø±Ù…Ø¬ Ù…Ù† ÙƒØªØ§Ø¨Ø© Ø§Ù„Ø¨Ø±Ù†Ø§Ù…Ø¬ Ù…Ø±Ø© ÙˆØ§ØØ¯Ø©ØŒ ÙˆØ§Ø³ØªØ®Ø¯Ø§Ù…Ù‡ Ø¹Ù„Ù‰ Ø£ÙŠ Ù†ÙˆØ¹ Ù…Ù† Ø§Ù„Ø£Ø¬Ù‡Ø²Ø© Ø£Ùˆ Ø§Ù„Ø£Ù†Ø¸Ù…Ø©ØŒ ÙˆÙ„Ø£ÙŠ Ù„ØºØ© Ø£Ùˆ Ø¯ÙˆÙ„Ø© ÙÙŠ Ø§Ù„Ø¹Ø§Ù„Ù… Ø£ÙŠÙ†Ù…Ø§ ÙƒØ§Ù†ØªØŒ Ø¯ÙˆÙ† Ø§Ù„ØØ§Ø¬Ø© Ù„Ø¥Ø¹Ø§Ø¯Ø© Ø§Ù„Ø¨Ø±Ù…Ø¬Ø© Ø£Ùˆ Ø¥Ø¬Ø±Ø§Ø¡ Ø£ÙŠ ØªØ¹Ø¯ÙŠÙ„. ÙˆØ£Ø®ÙŠØ±Ø§ØŒ ÙØ¥Ù† Ø§Ø³ØªØ®Ø¯Ø§Ù… "ÙŠÙˆÙ†ÙÙƒÙˆØ¯" Ø³ÙŠÙ…ÙƒÙ† Ø§Ù„Ø¨ÙŠØ§Ù†Ø§Øª Ù…Ù† Ø§Ù„Ø§Ù†ØªÙ‚Ø§Ù„ Ø¹Ø¨Ø± Ø§Ù„Ø£Ù†Ø¸Ù…Ø© ÙˆØ§Ù„Ø£Ø¬Ù‡Ø²Ø© Ø§Ù„Ù…Ø®ØªÙ„ÙØ© Ø¯ÙˆÙ† Ø£
 ÙŠ Ø®Ø·ÙˆØ±Ø© Ù„ØªØØ±ÙŠÙÙ‡Ø§ØŒ Ù…Ù‡Ù…Ø§ ØªØ¹Ø¯Ø¯Øª Ø§Ù„Ø´Ø±ÙƒØ§Øª Ø§Ù„ØµØ§Ù†Ø¹Ø© Ù„Ù„Ø£Ù†Ø¸Ù…Ø© ÙˆØ§Ù„Ù„ØºØ§ØªØŒ ÙˆØ§Ù„Ø¯ÙˆÙ„ Ø§Ù„ØªÙŠ ØªÙ…Ø± Ù…Ù† Ø®Ù„Ø§Ù„Ù‡Ø§ Ù‡Ø°Ù‡ Ø§Ù„Ø¨ÙŠØ§Ù†Ø§Øª.
> 
> Looks like most of the changes in java2d/* are related to spaces at the end of the line?

No, that are just incidental changes (see https://github.com/openjdk/jdk/pull/24566#issuecomment-2795201480). The actual change for the java2d files is the removal of the initial UTF-8 BOM. Github has a hard time showing this though, since the BOM is not visible.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2039258980