RFR: 8354266: Fix non-UTF-8 text encoding
Eirik Bjørsnøs
eirbjo at openjdk.org
Thu Apr 10 19:09:35 UTC 2025
On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie <ihse at openjdk.org> wrote:
> I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found.
>
> BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed.
>
> Methodology used:
>
> I have run four different tools for using different heuristics for determining the encoding of a file:
> * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5)
> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
> * enca (targeted towards obscure code pages)
> * libmagic / `file --mime-encoding`
>
> They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further:
> * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
>
> From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets:
> * All files where at least one tool claimed it to be UTF-8
> * All files where at least one tool claimed it to be *not* UTF-8
>
> For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay.
>
> For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure binary files, but without any telling exten...
LGTM.
There are some whitepace releated changes in this PR which seem okay, but has no mention in either the JBS or PR description.
Perhaps a short mention of this intention in either place would be good for future historians.
(BTW, I enjoyed seeing separate commits for the encoding and BOM changes, makes it easier to verify each!)
-------------
Marked as reviewed by eirbjo (Committer).
PR Review: https://git.openjdk.org/jdk/pull/24566#pullrequestreview-2758055634
More information about the build-dev
mailing list