RFR: 8354266: Fix non-UTF-8 text encoding

Naoto Sato naoto at openjdk.org
Thu Apr 10 17:41:25 UTC 2025


On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie <ihse at openjdk.org> wrote:

> I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. 
> 
> BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed.
> 
> Methodology used: 
> 
> I have run four different tools for using different heuristics for determining the encoding of a file:
> * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5)
> * uchardet (a modern version by freedesktop, used by e.g. Firefox)
> * enca (targeted towards obscure code pages)
> * libmagic / `file  --mime-encoding`
> 
> They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further:
> * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db`
> 
> From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets:
> * All files where at least one tool claimed it to be UTF-8
> * All files where at least one tool claimed it to be *not* UTF-8
> 
> For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay.
> 
> For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure binary files, but without any telling exten...

Marked as reviewed by naoto (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/24566#pullrequestreview-2757716905


More information about the hotspot-dev mailing list