RFR: 8354266: Fix non-UTF-8 text encoding

Magnus Ihse Bursie ihse at openjdk.org
Fri Apr 11 10:27:40 UTC 2025


On Fri, 11 Apr 2025 03:35:11 GMT, Sergey Bylokhov <serb at openjdk.org> wrote:

>> I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. 
>> 
>> BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed.
>> 
>> Methodology used: 
>> 
>> I have run four different tools for using different heuristics for determining the encoding of a file:
>> * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5)
>> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
>> * enca (targeted towards obscure code pages)
>> * libmagic / `file  --mime-encoding`
>> 
>> They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further:
>> * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
>> 
>> From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets:
>> * All files where at least one tool claimed it to be UTF-8
>> * All files where at least one tool claimed it to be *not* UTF-8
>> 
>> For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay.
>> 
>> For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure...
>
> src/demo/share/java2d/J2DBench/resources/textdata/arabic.ut8.txt line 11:
> 
>> 9: تخصص الشفرة الموحدة "يونِكود" رقما وحيدا لكل محرف في جميع اللغات العالمية، وذلك بغض النظر عن نوع الحاسوب أو البرامج المستخدمة. وقد تـم تبني مواصفة "يونِكود" مــن قبـل قادة الصانعين لأنظمة الحواسيب فـي العالم، مثل شركات آي.بي.إم. (IBM)ØŒ أبـل (APPLE)ØŒ هِيـْولِـت بـاكـرد (Hewlett-Packard) ØŒ مايكروسوفت (Microsoft)ØŒ أوراكِـل (Oracle) ØŒ صن (Sun) وغيرها. كما أن المواصفات والمقاييس الحديثة (مثل لغة البرمجة "جافا" "JAVA" ولغة "إكس إم إل" "XML" التي تستخدم لبرمجة الانترنيت) تتطلب استخدام "يونِكود". علاوة على ذلك ØŒ فإن "يونِكود" هي الطـريـقـة الرسـمية لتطبيق المقيـاس الـعـالـمي إيزو ١٠Ù
 ¦Ù¤Ù¦  (ISO 10646) .
>> 10: 
>> 11: إن بزوغ مواصفة "يونِكود" وتوفُّر الأنظمة التي تستخدمه وتدعمه، يعتبر من أهم الاختراعات الحديثة في عولمة البرمجيات لجميع اللغات في العالم. وإن استخدام "يونِكود" في عالم الانترنيت سيؤدي إلى توفير كبير مقارنة مع استخدام المجموعات التقليدية للمحارف المشفرة. كما أن استخدام "يونِكود" سيُمكِّن المبرمج من كتابة البرنامج مرة واحدة، واستخدامه على أي نوع من الأجهزة أو الأنظمة، ولأي لغة أو دولة في العالم أينما كانت، دون الحاجة لإعادة البرمجة أو إجراء أي تعديل. وأخيرا، فإن استخدام "يونِكود" سيمكن البيانات من الانتقال عبر الأنظمة والأجهزة المختلفة دون أ
 ي خطورة لتحريفها، مهما تعددت الشركات الصانعة للأنظمة واللغات، والدول التي تمر من خلالها هذه البيانات.
> 
> Looks like most of the changes in java2d/* are related to spaces at the end of the line?

No, that are just incidental changes (see https://github.com/openjdk/jdk/pull/24566#issuecomment-2795201480). The actual change for the java2d files is the removal of the initial UTF-8 BOM. Github has a hard time showing this though, since the BOM is not visible.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2039258980


More information about the hotspot-dev mailing list