RFR: 8354266: Fix non-UTF-8 text encoding
Sergey Bylokhov
serb at openjdk.org
Fri Apr 11 03:37:29 UTC 2025
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie <ihse at openjdk.org> wrote:
> I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found.
>
> BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed.
>
> Methodology used:
>
> I have run four different tools for using different heuristics for determining the encoding of a file:
> * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5)
> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
> * enca (targeted towards obscure code pages)
> * libmagic / `file --mime-encoding`
>
> They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further:
> * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
>
> From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets:
> * All files where at least one tool claimed it to be UTF-8
> * All files where at least one tool claimed it to be *not* UTF-8
>
> For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay.
>
> For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure binary files, but without any telling exten...
src/demo/share/java2d/J2DBench/resources/textdata/arabic.ut8.txt line 11:
> 9: تخصص Ø§ÙØ´Ùرة اÙÙ
ÙØØ¯Ø© "ÙÙÙÙÙÙØ¯" رÙÙ
ا ÙØÙØ¯Ø§ ÙÙÙ Ù
ØØ±Ù Ù٠جÙ
ÙØ¹ اÙÙØºØ§Øª Ø§ÙØ¹Ø§ÙÙ
ÙØ©Ø ÙØ°Ù٠بغض اÙÙØ¸Ø± ع٠ÙÙØ¹ Ø§ÙØØ§Ø³ÙØ¨ Ø£Ù Ø§ÙØ¨Ø±Ø§Ù
ج اÙÙ
ستخدÙ
Ø©. ÙÙØ¯ تÙÙ
تبÙÙ Ù
ÙØ§ØµÙØ© "ÙÙÙÙÙÙØ¯" Ù
ÙÙÙ ÙØ¨ÙÙ ÙØ§Ø¯Ø© Ø§ÙØµØ§ÙعÙÙ ÙØ£ÙظÙ
Ø© Ø§ÙØÙØ§Ø³Ùب ÙÙÙ Ø§ÙØ¹Ø§ÙÙ
Ø Ù
Ø«Ù Ø´Ø±ÙØ§Øª Ø¢Ù.بÙ.Ø¥Ù
. (IBM)Ø Ø£Ø¨ÙÙ (APPLE)Ø ÙÙÙÙÙÙÙÙÙØª Ø¨ÙØ§ÙÙØ±Ø¯ (Hewlett-Packard) Ø Ù
اÙÙØ±ÙسÙÙØª (Microsoft)Ø Ø£ÙØ±Ø§ÙÙÙÙ (Oracle) Ø ØµÙ (Sun) ÙØºÙØ±ÙØ§. ÙÙ
ا أ٠اÙÙ
ÙØ§ØµÙات ÙØ§ÙÙ
ÙØ§ÙÙØ³ Ø§ÙØØ¯ÙØ«Ø© (Ù
Ø«Ù ÙØºØ© Ø§ÙØ¨Ø±Ù
جة "Ø¬Ø§ÙØ§" "JAVA" ÙÙØºØ© "Ø¥ÙØ³ Ø¥Ù
Ø¥Ù" "XML" Ø§ÙØªÙ تستخدÙ
ÙØ¨Ø±Ù
جة Ø§ÙØ§ÙترÙÙØª) ØªØªØ·ÙØ¨ استخداÙ
"ÙÙÙÙÙÙØ¯". Ø¹ÙØ§ÙØ© عÙ٠ذÙÙ Ø ÙØ¥Ù "ÙÙÙÙÙÙØ¯" ÙÙ Ø§ÙØ·ÙرÙÙÙÙØ© Ø§ÙØ±Ø³ÙÙ
ÙØ© ÙØªØ·Ø¨Ù٠اÙÙ
ÙÙÙØ§Ø³ اÙÙØ¹ÙاÙÙÙ
Ù Ø¥ÙØ²Ù Ù¡Ù Ù¦
٤٦ (ISO 10646) .
> 10:
> 11: Ø¥Ù Ø¨Ø²ÙØº Ù
ÙØ§ØµÙØ© "ÙÙÙÙÙÙØ¯" ÙØªÙÙÙÙØ± Ø§ÙØ£ÙظÙ
Ø© Ø§ÙØªÙ تستخدÙ
Ù ÙØªØ¯Ø¹Ù
ÙØ ÙØ¹ØªØ¨Ø± Ù
٠أÙÙ
Ø§ÙØ§Ø®ØªØ±Ø§Ø¹Ø§Øª Ø§ÙØØ¯ÙØ«Ø© Ù٠عÙÙÙ
Ø© Ø§ÙØ¨Ø±Ù
Ø¬ÙØ§Øª ÙØ¬Ù
ÙØ¹ اÙÙØºØ§Øª ÙÙ Ø§ÙØ¹Ø§ÙÙ
. ÙØ¥Ù استخداÙ
"ÙÙÙÙÙÙØ¯" Ù٠عاÙÙ
Ø§ÙØ§ÙترÙÙØª Ø³ÙØ¤Ø¯Ù Ø¥Ù٠تÙÙÙØ± ÙØ¨Ùر Ù
ÙØ§Ø±ÙØ© Ù
ع استخداÙ
اÙÙ
جÙ
ÙØ¹Ø§Øª Ø§ÙØªÙÙÙØ¯ÙØ© ÙÙÙ
ØØ§Ø±Ù اÙÙ
Ø´ÙØ±Ø©. ÙÙ
ا أ٠استخداÙ
"ÙÙÙÙÙÙØ¯" سÙÙÙ
ÙÙÙ٠اÙÙ
برÙ
ج Ù
Ù ÙØªØ§Ø¨Ø© Ø§ÙØ¨Ø±ÙاÙ
ج Ù
رة ÙØ§ØØ¯Ø©Ø ÙØ§Ø³ØªØ®Ø¯Ø§Ù
٠عÙ٠أ٠ÙÙØ¹ Ù
Ù Ø§ÙØ£Ø¬Ùزة Ø£Ù Ø§ÙØ£ÙظÙ
Ø©Ø ÙÙØ£Ù ÙØºØ© أ٠دÙÙØ© ÙÙ Ø§ÙØ¹Ø§ÙÙ
Ø£ÙÙÙ
ا ÙØ§ÙØªØ Ø¯ÙÙ Ø§ÙØØ§Ø¬Ø© ÙØ¥Ø¹Ø§Ø¯Ø© Ø§ÙØ¨Ø±Ù
جة أ٠إجراء أ٠تعدÙÙ. ÙØ£Ø®ÙØ±Ø§Ø ÙØ¥Ù استخداÙ
"ÙÙÙÙÙÙØ¯" سÙÙ
ÙÙ Ø§ÙØ¨ÙØ§ÙØ§Øª Ù
Ù Ø§ÙØ§ÙØªÙØ§Ù عبر Ø§ÙØ£ÙظÙ
Ø© ÙØ§ÙØ£Ø¬ÙØ²Ø© اÙÙ
ختÙÙØ© دÙ٠أÙ
Ø®Ø·ÙØ±Ø© ÙØªØØ±ÙÙÙØ§Ø Ù
ÙÙ
ا تعددت Ø§ÙØ´Ø±Ùات Ø§ÙØµØ§Ùعة ÙÙØ£ÙظÙ
Ø© ÙØ§ÙÙØºØ§ØªØ ÙØ§ÙدÙÙ Ø§ÙØªÙ تÙ
ر Ù
Ù Ø®ÙØ§ÙÙØ§ ÙØ°Ù Ø§ÙØ¨ÙØ§ÙØ§Øª.
Looks like most of the changes in java2d/* are related to spaces at the end of the line?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2038746193
More information about the build-dev
mailing list