From naoto at openjdk.org Tue Apr 1 16:26:22 2025 From: naoto at openjdk.org (Naoto Sato) Date: Tue, 1 Apr 2025 16:26:22 GMT Subject: Integrated: 8353118: Deprecate the use of `java.locale.useOldISOCodes` system property In-Reply-To: References: Message-ID: On Fri, 28 Mar 2025 20:17:30 GMT, Naoto Sato wrote: > Proposing to remove the `java.locale.useOldISOCodes` system property. This property is for backward compatibility introduced back in JDK17 and I believe it is now fine to remove it. In this PR targeting JDK25, it emits a deprecate-for-removal warning on startup if the system property is set to true (no behavioral change except the warning). The plan is eventually to remove it after JDK25. A corresponding CSR has been drafted. This pull request has now been integrated. Changeset: 564066d5 Author: Naoto Sato URL: https://git.openjdk.org/jdk/commit/564066d549cf4ec7608f57ea4910b5813f7353c3 Stats: 23 lines in 3 files changed: 11 ins; 1 del; 11 mod 8353118: Deprecate the use of `java.locale.useOldISOCodes` system property Reviewed-by: iris, jlu ------------- PR: https://git.openjdk.org/jdk/pull/24302 From naoto at openjdk.org Tue Apr 1 16:26:22 2025 From: naoto at openjdk.org (Naoto Sato) Date: Tue, 1 Apr 2025 16:26:22 GMT Subject: RFR: 8353118: Deprecate the use of `java.locale.useOldISOCodes` system property In-Reply-To: References: Message-ID: On Fri, 28 Mar 2025 20:17:30 GMT, Naoto Sato wrote: > Proposing to remove the `java.locale.useOldISOCodes` system property. This property is for backward compatibility introduced back in JDK17 and I believe it is now fine to remove it. In this PR targeting JDK25, it emits a deprecate-for-removal warning on startup if the system property is set to true (no behavioral change except the warning). The plan is eventually to remove it after JDK25. A corresponding CSR has been drafted. Thanks for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/24302#issuecomment-2769911047 From jlu at openjdk.org Tue Apr 1 16:52:19 2025 From: jlu at openjdk.org (Justin Lu) Date: Tue, 1 Apr 2025 16:52:19 GMT Subject: RFR: 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate Message-ID: Please review this PR which specifies the `ChoiceFormat#parse(String, ParsePosition)` method. A corresponding CSR is filed. The current specification is simply "Parses a Number from the input text" which does not indicate how the value is returned. The criteria for a match, as well as no match should be made clear. ------------- Commit messages: - call 2-arg parse in example - init Changes: https://git.openjdk.org/jdk/pull/24361/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24361&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8353322 Stats: 20 lines in 1 file changed: 17 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/24361.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24361/head:pull/24361 PR: https://git.openjdk.org/jdk/pull/24361 From naoto at openjdk.org Tue Apr 1 18:22:15 2025 From: naoto at openjdk.org (Naoto Sato) Date: Tue, 1 Apr 2025 18:22:15 GMT Subject: RFR: 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate In-Reply-To: References: Message-ID: On Tue, 1 Apr 2025 16:45:26 GMT, Justin Lu wrote: > Please review this PR which specifies the `ChoiceFormat#parse(String, ParsePosition)` method. A corresponding CSR is filed. The current specification is simply "Parses a Number from the input text" which does not indicate how the value is returned. The criteria for a match, as well as no match should be made clear. src/java.base/share/classes/java/text/ChoiceFormat.java line 571: > 569: * {@snippet lang=java : > 570: * var fmt = new ChoiceFormat("0#foo|1#bar|2#baz"); > 571: * fmt.parse("baz", new ParsePosition(0)); // returns 2 This returns `2.0`? src/java.base/share/classes/java/text/ChoiceFormat.java line 576: > 574: * > 575: * @implNote The {@code Number} subtype returned by the JDK reference > 576: * implementation of this method is always {@code Double}. Do we need to use `@implNote` here? Since choices are `double`s (as in the class description), I think we can safely say this returns a `Double` as in normative text. If some implementation returns an `Integer`, I think it is a bug. Returning a `Double.NaN` for no-match may be considered implNote though (one might throw an exception). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24361#discussion_r2023404649 PR Review Comment: https://git.openjdk.org/jdk/pull/24361#discussion_r2023462230 From jlu at openjdk.org Tue Apr 1 19:04:25 2025 From: jlu at openjdk.org (Justin Lu) Date: Tue, 1 Apr 2025 19:04:25 GMT Subject: RFR: 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate [v2] In-Reply-To: References: Message-ID: > Please review this PR which specifies the `ChoiceFormat#parse(String, ParsePosition)` method. A corresponding CSR is filed. The current specification is simply "Parses a Number from the input text" which does not indicate how the value is returned. The criteria for a match, as well as no match should be made clear. Justin Lu has updated the pull request incrementally with one additional commit since the last revision: reflect Naoto's review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24361/files - new: https://git.openjdk.org/jdk/pull/24361/files/faaa9b9c..24d57bb2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24361&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24361&range=00-01 Stats: 10 lines in 1 file changed: 0 ins; 3 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/24361.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24361/head:pull/24361 PR: https://git.openjdk.org/jdk/pull/24361 From jlu at openjdk.org Tue Apr 1 19:04:25 2025 From: jlu at openjdk.org (Justin Lu) Date: Tue, 1 Apr 2025 19:04:25 GMT Subject: RFR: 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate [v2] In-Reply-To: References: Message-ID: On Tue, 1 Apr 2025 18:17:12 GMT, Naoto Sato wrote: >> Justin Lu has updated the pull request incrementally with one additional commit since the last revision: >> >> reflect Naoto's review > > src/java.base/share/classes/java/text/ChoiceFormat.java line 576: > >> 574: * >> 575: * @implNote The {@code Number} subtype returned by the JDK reference >> 576: * implementation of this method is always {@code Double}. > > Do we need to use `@implNote` here? Since choices are `double`s (as in the class description), I think we can safely say this returns a `Double` as in normative text. If some implementation returns an `Integer`, I think it is a bug. Returning a `Double.NaN` for no-match may be considered implNote though (one might throw an exception). I was either way on the `implNote`, since I thought an implementation could decide to normalize a double limit to an integral type. However that's probably unlikely and I agree the wording can be fine as normative since ChoiceFormat is composed of doubles. I think it's best to make returning Double.NaN normative (i.e. not allow flexibility for throwing an exception). The `NumberFormat.parse(String, ParsePosition)` methods return a failure value instead of throwing like `parse(String)` does. (E.g. DecimalFormat returns null on failed parse for 2 arg parse.) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24361#discussion_r2023523992 From naoto at openjdk.org Tue Apr 1 19:42:20 2025 From: naoto at openjdk.org (Naoto Sato) Date: Tue, 1 Apr 2025 19:42:20 GMT Subject: RFR: 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate [v2] In-Reply-To: References: Message-ID: <_wstnNEYpUPVZk5cU_nvJjetseFzPNBAJLohGEAawGA=.965b06ff-d215-440c-b3bd-489244947550@github.com> On Tue, 1 Apr 2025 19:04:25 GMT, Justin Lu wrote: >> Please review this PR which specifies the `ChoiceFormat#parse(String, ParsePosition)` method. A corresponding CSR is filed. The current specification is simply "Parses a Number from the input text" which does not indicate how the value is returned. The criteria for a match, as well as no match should be made clear. > > Justin Lu has updated the pull request incrementally with one additional commit since the last revision: > > reflect Naoto's review I am OK with returning `Double.NaN` as normative. I believe the risk is quite low, and it would be only a conformance issue (no practical problem will arise) src/java.base/share/classes/java/text/ChoiceFormat.java line 564: > 562: * {@code Double}. The value returned is the {@code limit} corresponding > 563: * to the {@code format} that is the longest substring of the input text. > 564: * Matching is done in ascending order, when multiple {@code formats} match Nit: {@code format}s src/java.base/share/classes/java/text/ChoiceFormat.java line 584: > 582: * first index of the character that caused the parse to fail. > 583: * @return A Number which represents the {@code limit} corresponding to the {@code > 584: * format} parsed. We could clarify the no match case with `Double.NaN` here too ------------- PR Review: https://git.openjdk.org/jdk/pull/24361#pullrequestreview-2733870693 PR Review Comment: https://git.openjdk.org/jdk/pull/24361#discussion_r2023570395 PR Review Comment: https://git.openjdk.org/jdk/pull/24361#discussion_r2023571566 From jlu at openjdk.org Tue Apr 1 20:32:45 2025 From: jlu at openjdk.org (Justin Lu) Date: Tue, 1 Apr 2025 20:32:45 GMT Subject: RFR: 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate [v3] In-Reply-To: References: Message-ID: > Please review this PR which specifies the `ChoiceFormat#parse(String, ParsePosition)` method. A corresponding CSR is filed. The current specification is simply "Parses a Number from the input text" which does not indicate how the value is returned. The criteria for a match, as well as no match should be made clear. Justin Lu has updated the pull request incrementally with one additional commit since the last revision: Address further comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24361/files - new: https://git.openjdk.org/jdk/pull/24361/files/24d57bb2..d3864418 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24361&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24361&range=01-02 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/24361.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24361/head:pull/24361 PR: https://git.openjdk.org/jdk/pull/24361 From jlu at openjdk.org Tue Apr 1 20:37:07 2025 From: jlu at openjdk.org (Justin Lu) Date: Tue, 1 Apr 2025 20:37:07 GMT Subject: RFR: 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate [v2] In-Reply-To: <_wstnNEYpUPVZk5cU_nvJjetseFzPNBAJLohGEAawGA=.965b06ff-d215-440c-b3bd-489244947550@github.com> References: <_wstnNEYpUPVZk5cU_nvJjetseFzPNBAJLohGEAawGA=.965b06ff-d215-440c-b3bd-489244947550@github.com> Message-ID: On Tue, 1 Apr 2025 19:36:04 GMT, Naoto Sato wrote: >> Justin Lu has updated the pull request incrementally with one additional commit since the last revision: >> >> reflect Naoto's review > > src/java.base/share/classes/java/text/ChoiceFormat.java line 564: > >> 562: * {@code Double}. The value returned is the {@code limit} corresponding >> 563: * to the {@code format} that is the longest substring of the input text. >> 564: * Matching is done in ascending order, when multiple {@code formats} match > > Nit: {@code format}s Sounds good. Addressed the conformance issue possibility in the CSR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24361#discussion_r2023639927 From naoto at openjdk.org Tue Apr 1 22:56:12 2025 From: naoto at openjdk.org (Naoto Sato) Date: Tue, 1 Apr 2025 22:56:12 GMT Subject: RFR: 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate [v3] In-Reply-To: References: Message-ID: On Tue, 1 Apr 2025 20:32:45 GMT, Justin Lu wrote: >> Please review this PR which specifies the `ChoiceFormat#parse(String, ParsePosition)` method. A corresponding CSR is filed. The current specification is simply "Parses a Number from the input text" which does not indicate how the value is returned. The criteria for a match, as well as no match should be made clear. > > Justin Lu has updated the pull request incrementally with one additional commit since the last revision: > > Address further comments LGTM ------------- Marked as reviewed by naoto (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24361#pullrequestreview-2734200809 From alanb at openjdk.org Wed Apr 2 10:07:19 2025 From: alanb at openjdk.org (Alan Bateman) Date: Wed, 2 Apr 2025 10:07:19 GMT Subject: RFR: 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate [v3] In-Reply-To: References: Message-ID: On Tue, 1 Apr 2025 20:32:45 GMT, Justin Lu wrote: >> Please review this PR which specifies the `ChoiceFormat#parse(String, ParsePosition)` method. A corresponding CSR is filed. The current specification is simply "Parses a Number from the input text" which does not indicate how the value is returned. The criteria for a match, as well as no match should be made clear. > > Justin Lu has updated the pull request incrementally with one additional commit since the last revision: > > Address further comments src/java.base/share/classes/java/text/ChoiceFormat.java line 562: > 560: /** > 561: * Parses a {@code Number} from the input text, the subtype of which is always > 562: * {@code Double}. The value returned is the {@code limit} corresponding I wonder if we could improve the first sentence, e.g. "Parses the input text from the parse position as a Double" ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24361#discussion_r2024515574 From jlu at openjdk.org Wed Apr 2 19:46:12 2025 From: jlu at openjdk.org (Justin Lu) Date: Wed, 2 Apr 2025 19:46:12 GMT Subject: RFR: 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate [v4] In-Reply-To: References: Message-ID: > Please review this PR which specifies the `ChoiceFormat#parse(String, ParsePosition)` method. A corresponding CSR is filed. The current specification is simply "Parses a Number from the input text" which does not indicate how the value is returned. The criteria for a match, as well as no match should be made clear. Justin Lu has updated the pull request incrementally with one additional commit since the last revision: Alan's review - Improve first sentence ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24361/files - new: https://git.openjdk.org/jdk/pull/24361/files/d3864418..0ffdef97 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24361&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24361&range=02-03 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/24361.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24361/head:pull/24361 PR: https://git.openjdk.org/jdk/pull/24361 From jlu at openjdk.org Wed Apr 2 19:46:13 2025 From: jlu at openjdk.org (Justin Lu) Date: Wed, 2 Apr 2025 19:46:13 GMT Subject: RFR: 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate [v3] In-Reply-To: References: Message-ID: On Wed, 2 Apr 2025 10:05:04 GMT, Alan Bateman wrote: >> Justin Lu has updated the pull request incrementally with one additional commit since the last revision: >> >> Address further comments > > src/java.base/share/classes/java/text/ChoiceFormat.java line 562: > >> 560: /** >> 561: * Parses a {@code Number} from the input text, the subtype of which is always >> 562: * {@code Double}. The value returned is the {@code limit} corresponding > > I wonder if we could improve the first sentence, e.g. "Parses the input text from the parse position as a Double" ? Right, I think we can make the sub-type wording simplification and should mention `ParsePosition`'s role in the method. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24361#discussion_r2025480425 From naoto at openjdk.org Wed Apr 2 19:50:53 2025 From: naoto at openjdk.org (Naoto Sato) Date: Wed, 2 Apr 2025 19:50:53 GMT Subject: RFR: 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate [v4] In-Reply-To: References: Message-ID: On Wed, 2 Apr 2025 19:46:12 GMT, Justin Lu wrote: >> Please review this PR which specifies the `ChoiceFormat#parse(String, ParsePosition)` method. A corresponding CSR is filed. The current specification is simply "Parses a Number from the input text" which does not indicate how the value is returned. The criteria for a match, as well as no match should be made clear. > > Justin Lu has updated the pull request incrementally with one additional commit since the last revision: > > Alan's review - Improve first sentence Marked as reviewed by naoto (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/24361#pullrequestreview-2737401960 From alanb at openjdk.org Thu Apr 3 06:25:50 2025 From: alanb at openjdk.org (Alan Bateman) Date: Thu, 3 Apr 2025 06:25:50 GMT Subject: RFR: 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate [v3] In-Reply-To: References: Message-ID: On Wed, 2 Apr 2025 19:43:29 GMT, Justin Lu wrote: >> src/java.base/share/classes/java/text/ChoiceFormat.java line 562: >> >>> 560: /** >>> 561: * Parses a {@code Number} from the input text, the subtype of which is always >>> 562: * {@code Double}. The value returned is the {@code limit} corresponding >> >> I wonder if we could improve the first sentence, e.g. "Parses the input text from the parse position as a Double" ? > > Right, I think we can make the sub-type wording simplification and should mention `ParsePosition`'s role in the method. Thanks for the update, it reads much better now, no other comments from me. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24361#discussion_r2026287478 From jlu at openjdk.org Fri Apr 4 21:29:18 2025 From: jlu at openjdk.org (Justin Lu) Date: Fri, 4 Apr 2025 21:29:18 GMT Subject: RFR: 8353713: Improve Currency.getInstance exception handling Message-ID: Please review this PR which improves some Currency `IllegalArgumentException`s by including the input in the message. This could be a currency code, country code, or locale. This change also includes tests to check the messages for an invalid country via the region override as well as an invalid country code within a 3 length currency code. ------------- Commit messages: - init Changes: https://git.openjdk.org/jdk/pull/24459/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24459&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8353713 Stats: 38 lines in 2 files changed: 13 ins; 0 del; 25 mod Patch: https://git.openjdk.org/jdk/pull/24459.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24459/head:pull/24459 PR: https://git.openjdk.org/jdk/pull/24459 From naoto at openjdk.org Fri Apr 4 22:54:53 2025 From: naoto at openjdk.org (Naoto Sato) Date: Fri, 4 Apr 2025 22:54:53 GMT Subject: RFR: 8353713: Improve Currency.getInstance exception handling In-Reply-To: References: Message-ID: On Fri, 4 Apr 2025 21:25:00 GMT, Justin Lu wrote: > Please review this PR which improves some Currency `IllegalArgumentException`s by including the input in the message. This could be a currency code, country code, or locale. This change also includes tests to check the messages for an invalid country via the region override as well as an invalid country code within a 3 length currency code. Looks good. test/jdk/java/util/Currency/CurrencyTest.java line 102: > 100: IllegalArgumentException ex = assertThrows(IllegalArgumentException.class, () -> > 101: Currency.getInstance(badCode), "getInstance() did not throw IAE"); > 102: assertEquals("The country code: \"%s\" is not a valid ISO 3166 code" Since the test is not parameterized, we can simply use ".." inside the expected string literal. ------------- PR Review: https://git.openjdk.org/jdk/pull/24459#pullrequestreview-2744244252 PR Review Comment: https://git.openjdk.org/jdk/pull/24459#discussion_r2029528590 From jlu at openjdk.org Fri Apr 4 23:03:23 2025 From: jlu at openjdk.org (Justin Lu) Date: Fri, 4 Apr 2025 23:03:23 GMT Subject: RFR: 8353713: Improve Currency.getInstance exception handling [v2] In-Reply-To: References: Message-ID: > Please review this PR which improves some Currency `IllegalArgumentException`s by including the input in the message. This could be a currency code, country code, or locale. This change also includes tests to check the messages for an invalid country via the region override as well as an invalid country code within a 3 length currency code. Justin Lu has updated the pull request incrementally with one additional commit since the last revision: Naoto's review -> use str literal since not param test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24459/files - new: https://git.openjdk.org/jdk/pull/24459/files/e79241d0..dab5091b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24459&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24459&range=00-01 Stats: 4 lines in 1 file changed: 0 ins; 1 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/24459.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24459/head:pull/24459 PR: https://git.openjdk.org/jdk/pull/24459 From naoto at openjdk.org Mon Apr 7 16:30:49 2025 From: naoto at openjdk.org (Naoto Sato) Date: Mon, 7 Apr 2025 16:30:49 GMT Subject: RFR: 8353713: Improve Currency.getInstance exception handling [v2] In-Reply-To: References: Message-ID: On Fri, 4 Apr 2025 23:03:23 GMT, Justin Lu wrote: >> Please review this PR which improves some Currency `IllegalArgumentException`s by including the input in the message. This could be a currency code, country code, or locale. This change also includes tests to check the messages for an invalid country via the region override as well as an invalid country code within a 3 length currency code. > > Justin Lu has updated the pull request incrementally with one additional commit since the last revision: > > Naoto's review -> use str literal since not param test Marked as reviewed by naoto (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/24459#pullrequestreview-2747430868 From jlu at openjdk.org Mon Apr 7 20:48:17 2025 From: jlu at openjdk.org (Justin Lu) Date: Mon, 7 Apr 2025 20:48:17 GMT Subject: RFR: 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate [v4] In-Reply-To: References: Message-ID: On Wed, 2 Apr 2025 19:46:12 GMT, Justin Lu wrote: >> Please review this PR which specifies the `ChoiceFormat#parse(String, ParsePosition)` method. A corresponding CSR is filed. The current specification is simply "Parses a Number from the input text" which does not indicate how the value is returned. The criteria for a match, as well as no match should be made clear. > > Justin Lu has updated the pull request incrementally with one additional commit since the last revision: > > Alan's review - Improve first sentence Thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24361#issuecomment-2784586618 From jlu at openjdk.org Mon Apr 7 20:48:17 2025 From: jlu at openjdk.org (Justin Lu) Date: Mon, 7 Apr 2025 20:48:17 GMT Subject: Integrated: 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate In-Reply-To: References: Message-ID: On Tue, 1 Apr 2025 16:45:26 GMT, Justin Lu wrote: > Please review this PR which specifies the `ChoiceFormat#parse(String, ParsePosition)` method. A corresponding CSR is filed. The current specification is simply "Parses a Number from the input text" which does not indicate how the value is returned. The criteria for a match, as well as no match should be made clear. This pull request has now been integrated. Changeset: a8dfcf55 Author: Justin Lu URL: https://git.openjdk.org/jdk/commit/a8dfcf55849775a7ac4822a8b7661f20f1b33bb0 Stats: 17 lines in 1 file changed: 14 ins; 0 del; 3 mod 8353322: Specification of ChoiceFormat#parse(String, ParsePosition) is inadequate Reviewed-by: naoto ------------- PR: https://git.openjdk.org/jdk/pull/24361 From jlu at openjdk.org Tue Apr 8 17:40:24 2025 From: jlu at openjdk.org (Justin Lu) Date: Tue, 8 Apr 2025 17:40:24 GMT Subject: Integrated: 8353713: Improve Currency.getInstance exception handling In-Reply-To: References: Message-ID: On Fri, 4 Apr 2025 21:25:00 GMT, Justin Lu wrote: > Please review this PR which improves some Currency `IllegalArgumentException`s by including the input in the message. This could be a currency code, country code, or locale. This change also includes tests to check the messages for an invalid country via the region override as well as an invalid country code within a 3 length currency code. This pull request has now been integrated. Changeset: 5cac5796 Author: Justin Lu URL: https://git.openjdk.org/jdk/commit/5cac579619164b9a664327a4f71c4de7e7575276 Stats: 37 lines in 2 files changed: 12 ins; 0 del; 25 mod 8353713: Improve Currency.getInstance exception handling Reviewed-by: naoto ------------- PR: https://git.openjdk.org/jdk/pull/24459 From ihse at openjdk.org Wed Apr 9 15:09:58 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Wed, 9 Apr 2025 15:09:58 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v6] In-Reply-To: References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> Message-ID: <_YOUyzMbSEXFduCKVgyis37kwTlGSjBbP8VlFu3xQpU=.9b668e2a-8f91-476d-8914-13dc33a0b9e5@github.com> On Thu, 11 May 2023 20:21:57 GMT, Justin Lu wrote: >> This PR converts Unicode sequences to UTF-8 native in .properties file. (Excluding the Unicode space and tab sequence). The conversion was done using native2ascii. >> >> In addition, the build logic is adjusted to support reading in the .properties files as UTF-8 during the conversion from .properties file to .java ListResourceBundle file. > > Justin Lu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits: > > - Convert the merged master changes to UTF-8 > - Merge master and fix conflicts > - Close streams when finished loading into props > - Adjust CF test to read in with UTF-8 to fix failing test > - Reconvert CS.properties to UTF-8 > - Revert all changes to CurrencySymbols.properties > - Bug6204853 should not be converted > - Copyright year for CompileProperties > - Redo translation for CS.properties > - Spot convert CurrencySymbols.properties > - ... and 6 more: https://git.openjdk.org/jdk/compare/4386d42d...f15b373a src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/Encodings.properties line 22: > 20: # Peter Smolik > 21: Cp1250 WINDOWS-1250 0x00FF > 22: # Patch attributed to havardw at underdusken.no (H?vard Wigtil) This does not seem to have been a correct conversion. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12726#discussion_r2035582242 From jlu at openjdk.org Wed Apr 9 21:28:41 2025 From: jlu at openjdk.org (Justin Lu) Date: Wed, 9 Apr 2025 21:28:41 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v6] In-Reply-To: <_YOUyzMbSEXFduCKVgyis37kwTlGSjBbP8VlFu3xQpU=.9b668e2a-8f91-476d-8914-13dc33a0b9e5@github.com> References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> <_YOUyzMbSEXFduCKVgyis37kwTlGSjBbP8VlFu3xQpU=.9b668e2a-8f91-476d-8914-13dc33a0b9e5@github.com> Message-ID: On Wed, 9 Apr 2025 15:06:32 GMT, Magnus Ihse Bursie wrote: >> Justin Lu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits: >> >> - Convert the merged master changes to UTF-8 >> - Merge master and fix conflicts >> - Close streams when finished loading into props >> - Adjust CF test to read in with UTF-8 to fix failing test >> - Reconvert CS.properties to UTF-8 >> - Revert all changes to CurrencySymbols.properties >> - Bug6204853 should not be converted >> - Copyright year for CompileProperties >> - Redo translation for CS.properties >> - Spot convert CurrencySymbols.properties >> - ... and 6 more: https://git.openjdk.org/jdk/compare/4386d42d...f15b373a > > src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/Encodings.properties line 22: > >> 20: # Peter Smolik >> 21: Cp1250 WINDOWS-1250 0x00FF >> 22: # Patch attributed to havardw at underdusken.no (H?vard Wigtil) > > This does not seem to have been a correct conversion. Right, that `?` looks to have been incorrectly converted during the ISO-8859-1 to UTF-8 conversion. (I can't find the script used for conversion as this change is from some time ago.) Since the change occurs in a comment (thankfully), it should be harmless and the next upstream update of this file would overwrite this incorrect change. However, this file does not seem to be updated that often, so I can also file an issue to correct this if you would prefer that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12726#discussion_r2036165417 From ihse at openjdk.org Thu Apr 10 07:34:37 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 10 Apr 2025 07:34:37 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v6] In-Reply-To: References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> <_YOUyzMbSEXFduCKVgyis37kwTlGSjBbP8VlFu3xQpU=.9b668e2a-8f91-476d-8914-13dc33a0b9e5@github.com> Message-ID: On Wed, 9 Apr 2025 21:26:15 GMT, Justin Lu wrote: >> src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/Encodings.properties line 22: >> >>> 20: # Peter Smolik >>> 21: Cp1250 WINDOWS-1250 0x00FF >>> 22: # Patch attributed to havardw at underdusken.no (H?vard Wigtil) >> >> This does not seem to have been a correct conversion. > > Right, that `?` looks to have been incorrectly converted during the ISO-8859-1 to UTF-8 conversion. (I can't find the script used for conversion as this change is from some time ago.) > > Since the change occurs in a comment (thankfully), it should be harmless and the next upstream update of this file would overwrite this incorrect change. However, this file does not seem to be updated that often, so I can also file an issue to correct this if you would prefer that. You don't have to do that, I'm working on an omnibus UTF-8 fixing PR right now, where I will include a fix for this as well. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12726#discussion_r2036695622 From ihse at openjdk.org Thu Apr 10 07:34:37 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 10 Apr 2025 07:34:37 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v6] In-Reply-To: References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> <_YOUyzMbSEXFduCKVgyis37kwTlGSjBbP8VlFu3xQpU=.9b668e2a-8f91-476d-8914-13dc33a0b9e5@github.com> Message-ID: On Thu, 10 Apr 2025 07:31:37 GMT, Magnus Ihse Bursie wrote: >> Right, that `?` looks to have been incorrectly converted during the ISO-8859-1 to UTF-8 conversion. (I can't find the script used for conversion as this change is from some time ago.) >> >> Since the change occurs in a comment (thankfully), it should be harmless and the next upstream update of this file would overwrite this incorrect change. However, this file does not seem to be updated that often, so I can also file an issue to correct this if you would prefer that. > > You don't have to do that, I'm working on an omnibus UTF-8 fixing PR right now, where I will include a fix for this as well. If anything, I might be a bit worried that there are more incorrect conversions stemming from this PR, that my automated tools and manual scanning has not revealed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12726#discussion_r2036696723 From eirbjo at openjdk.org Thu Apr 10 08:10:42 2025 From: eirbjo at openjdk.org (Eirik =?UTF-8?B?QmrDuHJzbsO4cw==?=) Date: Thu, 10 Apr 2025 08:10:42 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v6] In-Reply-To: References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> <_YOUyzMbSEXFduCKVgyis37kwTlGSjBbP8VlFu3xQpU=.9b668e2a-8f91-476d-8914-13dc33a0b9e5@github.com> Message-ID: <6c6DqyCqyPonBZgUU8BpYJR3JQvMXjWm9ulq4SN25Do=.77775825-716d-4908-ae24-c4cf1ead78a5@github.com> On Thu, 10 Apr 2025 07:32:18 GMT, Magnus Ihse Bursie wrote: >> You don't have to do that, I'm working on an omnibus UTF-8 fixing PR right now, where I will include a fix for this as well. > > If anything, I might be a bit worried that there are more incorrect conversions stemming from this PR, that my automated tools and manual scanning has not revealed. Some observations: 1: This PR seems to have been abondoned, so perhaps this discussion belongs in #15694 ? 2: The `?` (Unicode 'Latin small letter a with ring above' U+00E5) was correctly encoded as 0xEF in ISO-8859-1 previous to this change. 3: The conversion changed this `0xEF` to the three-byte sequence `ef bf bd` 4: This is as-if the file was incorrctly decoded using UTF-8, then encoded using UTF-8: byte[] origBytes = "?".getBytes(StandardCharsets.ISO_8859_1); String decoded = new String(origBytes, StandardCharsets.UTF_8); byte[] encoded = decoded.getBytes(StandardCharsets.UTF_8); String hex = HexFormat.of().formatHex(encoded); assertEquals("efbfbd", hex); ``` Like @magicus I'm worried that similar incorrect decoding could have been introduced by the same script in other files. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12726#discussion_r2036767319 From ihse at openjdk.org Thu Apr 10 08:38:38 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 10 Apr 2025 08:38:38 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v6] In-Reply-To: <6c6DqyCqyPonBZgUU8BpYJR3JQvMXjWm9ulq4SN25Do=.77775825-716d-4908-ae24-c4cf1ead78a5@github.com> References: <0MB7FLFNfaGEWssr9X54UJ_iZNFWBJkxQ1yusP7fsuY=.3f9f3de5-fe84-48e6-9449-626cac42da0b@github.com> <_YOUyzMbSEXFduCKVgyis37kwTlGSjBbP8VlFu3xQpU=.9b668e2a-8f91-476d-8914-13dc33a0b9e5@github.com> <6c6DqyCqyPonBZgUU8BpYJR3JQvMXjWm9ulq4SN25Do=.77775825-716d-4908-ae24-c4cf1ead78a5@github.com> Message-ID: On Thu, 10 Apr 2025 08:08:02 GMT, Eirik Bj?rsn?s wrote: >> If anything, I might be a bit worried that there are more incorrect conversions stemming from this PR, that my automated tools and manual scanning has not revealed. > > Some observations: > > 1: This PR seems to have been abondoned, so perhaps this discussion belongs in #15694 ? > > 2: The `?` (Unicode 'Latin small letter a with ring above' U+00E5) was correctly encoded as 0xEF in ISO-8859-1 previous to this change. > > 3: The conversion changed this `0xEF` to the three-byte sequence `ef bf bd` > > 4: This is as-if the file was incorrctly decoded using UTF-8, then encoded using UTF-8: > > > byte[] origBytes = "?".getBytes(StandardCharsets.ISO_8859_1); > String decoded = new String(origBytes, StandardCharsets.UTF_8); > byte[] encoded = decoded.getBytes(StandardCharsets.UTF_8); > String hex = HexFormat.of().formatHex(encoded); > assertEquals("efbfbd", hex); > ``` > > Like @magicus I'm worried that similar incorrect decoding could have been introduced by the same script in other files. > This PR seems to have been abondoned, so perhaps this discussion belongs in https://github.com/openjdk/jdk/pull/15694 ? Oh, I didn't notice this was supplanted by another PR. It might be better to continue there, yes. Even if closed PRs seldom are the best places to conduct discussions, I think it might be a good idea to scrutinize all files modified by this script. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12726#discussion_r2036820765 From ihse at openjdk.org Thu Apr 10 08:41:45 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 10 Apr 2025 08:41:45 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v2] In-Reply-To: References: Message-ID: On Wed, 13 Sep 2023 17:38:28 GMT, Justin Lu wrote: >> JDK .properties files still use ISO-8859-1 encoding with escape sequences. It would improve readability to see the native characters instead of escape sequences (especially for the L10n process). The majority of files changed are localized resource files. >> >> This change converts the Unicode escape sequences in the JDK .properties files (both in src and test) to UTF-8 native characters. Additionally, the build logic is adjusted to read the .properties files in UTF-8 while generating the ListResourceBundle files. >> >> The only escape sequence not converted was `\u0020` as this is used to denote intentional trailing white space. (E.g. `key=This is the value:\u0020`) >> >> The conversion was done using native2ascii with options `-reverse -encoding UTF-8`. >> >> If this PR is integrated, the IDE default encoding for .properties files need to be updated to UTF-8. (IntelliJ IDEA locks .properties files as ISO-8859-1 unless manually changed). > > Justin Lu has updated the pull request incrementally with one additional commit since the last revision: > > Replace InputStreamReader with BufferedReader Continuing the discussion that was started at a predecessor to this PR, https://github.com/openjdk/jdk/pull/12726#discussion_r2035582242. At least one incorrect conversion has been found in this PR. It might be worthwhile to double- and triple-check all the other conversions as well. As part of https://bugs.openjdk.org/browse/JDK-8301971 I am trying various ways of detecting files without UTF-8 encoding, but it is still a bit of hit and miss, since there are no surefire way of telling which encoding a file has, only heuristics. So finding and following up potential sources of error is important. ------------- PR Comment: https://git.openjdk.org/jdk/pull/15694#issuecomment-2791991649 PR Comment: https://git.openjdk.org/jdk/pull/15694#issuecomment-2791997157 From eirbjo at openjdk.org Thu Apr 10 08:48:37 2025 From: eirbjo at openjdk.org (Eirik =?UTF-8?B?QmrDuHJzbsO4cw==?=) Date: Thu, 10 Apr 2025 08:48:37 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v2] In-Reply-To: References: Message-ID: <0q0gTsqIsYtmzAfNYbBXksUXKdZh2uzQ9yvSETKAP88=.137372e6-d63e-4539-b196-4bd9ef1ddd16@github.com> On Wed, 13 Sep 2023 17:38:28 GMT, Justin Lu wrote: >> JDK .properties files still use ISO-8859-1 encoding with escape sequences. It would improve readability to see the native characters instead of escape sequences (especially for the L10n process). The majority of files changed are localized resource files. >> >> This change converts the Unicode escape sequences in the JDK .properties files (both in src and test) to UTF-8 native characters. Additionally, the build logic is adjusted to read the .properties files in UTF-8 while generating the ListResourceBundle files. >> >> The only escape sequence not converted was `\u0020` as this is used to denote intentional trailing white space. (E.g. `key=This is the value:\u0020`) >> >> The conversion was done using native2ascii with options `-reverse -encoding UTF-8`. >> >> If this PR is integrated, the IDE default encoding for .properties files need to be updated to UTF-8. (IntelliJ IDEA locks .properties files as ISO-8859-1 unless manually changed). > > Justin Lu has updated the pull request incrementally with one additional commit since the last revision: > > Replace InputStreamReader with BufferedReader FWIW, I checked out the revision of the commit previous to this change and found the following: % git checkout b55e418a077791b39992042411cde97f68dc39fe^ % find src -name "*.properties" | xargs file | grep -v ASCII src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/Encodings.properties: ISO-8859 text src/java.xml.crypto/share/classes/com/sun/org/apache/xml/internal/security/resource/xmlsecurity_de.properties: Unicode text, UTF-8 text, with very long lines (322) Which indicates that that this is the only non-ASCII, non-UTF-8 property file. So we may be lucky. ------------- PR Comment: https://git.openjdk.org/jdk/pull/15694#issuecomment-2792014164 From ihse at openjdk.org Thu Apr 10 09:45:56 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 10 Apr 2025 09:45:56 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v2] In-Reply-To: References: Message-ID: On Wed, 13 Sep 2023 17:38:28 GMT, Justin Lu wrote: >> JDK .properties files still use ISO-8859-1 encoding with escape sequences. It would improve readability to see the native characters instead of escape sequences (especially for the L10n process). The majority of files changed are localized resource files. >> >> This change converts the Unicode escape sequences in the JDK .properties files (both in src and test) to UTF-8 native characters. Additionally, the build logic is adjusted to read the .properties files in UTF-8 while generating the ListResourceBundle files. >> >> The only escape sequence not converted was `\u0020` as this is used to denote intentional trailing white space. (E.g. `key=This is the value:\u0020`) >> >> The conversion was done using native2ascii with options `-reverse -encoding UTF-8`. >> >> If this PR is integrated, the IDE default encoding for .properties files need to be updated to UTF-8. (IntelliJ IDEA locks .properties files as ISO-8859-1 unless manually changed). > > Justin Lu has updated the pull request incrementally with one additional commit since the last revision: > > Replace InputStreamReader with BufferedReader Thanks for checking! ------------- PR Comment: https://git.openjdk.org/jdk/pull/15694#issuecomment-2792170460 From ihse at openjdk.org Thu Apr 10 10:18:13 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 10 Apr 2025 10:18:13 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further: > * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure binary files, but without any telling exten... src/hotspot/cpu/x86/macroAssembler_x86_sha.cpp line 497: > 495: /* > 496: The algorithm below is based on Intel publication: > 497: "Fast SHA-256 Implementations on Intel(R) Architecture Processors" by Jim Guilford, Kirk Yap and Vinodh Gopal. Note: There is of course a unicode `?` symbol, which is what it was originally before it was botched here, but I found no reason to keep this, and in the spirit of JDK-8354213, I thought it better to use pure ASCII here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037012318 From ihse at openjdk.org Thu Apr 10 10:18:13 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 10 Apr 2025 10:18:13 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding Message-ID: I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed. Methodology used: I have run four different tools for using different heuristics for determining the encoding of a file: * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5) * uchardet (a modern version by freedesktop, used by e.g. Firefox) * enca (targeted towards obscure code pages) * libmagic / `file --mime-encoding` They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further: * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` >From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets: * All files where at least one tool claimed it to be UTF-8 * All files where at least one tool claimed it to be *not* UTF-8 For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay. For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure binary files, but without any telling extensions (most of these are in tests). The BOM files were only pointed out by chardetect; I did run an additional search for UTF-8 BOM markers over the code base to make sure I did not miss any others (since chardetect apart from this did a not-so-perfect job). The files included in this PR are what I actually found that had encoding errors or issues. ------------- Commit messages: - Remove UTF-8 BOM (byte-order mark) which is discouraged by the Unicode Consortium - Fix incorrect encoding Changes: https://git.openjdk.org/jdk/pull/24566/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24566&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8354266 Stats: 32 lines in 13 files changed: 0 ins; 2 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/24566.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24566/head:pull/24566 PR: https://git.openjdk.org/jdk/pull/24566 From ihse at openjdk.org Thu Apr 10 10:23:56 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 10 Apr 2025 10:23:56 GMT Subject: RFR: 8354273: Restore even more pointless unicode characters to ASCII Message-ID: As a follow-up to [JDK-8354213](https://bugs.openjdk.org/browse/JDK-8354213), I found some additional places where unicode characters are unnecessarily used instead of pure ASCII. ------------- Commit messages: - 8354273: Restore even more pointless unicode characters to ASCII Changes: https://git.openjdk.org/jdk/pull/24567/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24567&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8354273 Stats: 9 lines in 6 files changed: 0 ins; 1 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/24567.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24567/head:pull/24567 PR: https://git.openjdk.org/jdk/pull/24567 From ihse at openjdk.org Thu Apr 10 10:36:31 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 10 Apr 2025 10:36:31 GMT Subject: RFR: 8354273: Restore even more pointless unicode characters to ASCII [v2] In-Reply-To: References: Message-ID: > As a follow-up to [JDK-8354213](https://bugs.openjdk.org/browse/JDK-8354213), I found some additional places where unicode characters are unnecessarily used instead of pure ASCII. Magnus Ihse Bursie has updated the pull request incrementally with one additional commit since the last revision: Remove incorrectly copied "?anchor" ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24567/files - new: https://git.openjdk.org/jdk/pull/24567/files/d9527eb9..876708c2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24567&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24567&range=00-01 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/24567.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24567/head:pull/24567 PR: https://git.openjdk.org/jdk/pull/24567 From ihse at openjdk.org Thu Apr 10 10:39:32 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 10 Apr 2025 10:39:32 GMT Subject: RFR: 8354273: Restore even more pointless unicode characters to ASCII [v2] In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 10:36:31 GMT, Magnus Ihse Bursie wrote: >> As a follow-up to [JDK-8354213](https://bugs.openjdk.org/browse/JDK-8354213), I found some additional places where unicode characters are unnecessarily used instead of pure ASCII. > > Magnus Ihse Bursie has updated the pull request incrementally with one additional commit since the last revision: > > Remove incorrectly copied "?anchor" src/java.xml/share/legal/xmlxsd.md line 29: > 27: https://www.w3.org/copyright/software-license-2023/" > 28: > 29: Disclaimers ?anchor This is an incorrectly copied piece of html; compare how the very same license is handled in e.g. `src/java.xml/share/legal/schema10part1.md`. The ? is the non-ascii character that triggered my detection of this, but the entire "anchor" string is incorrect here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24567#discussion_r2037047696 From rgiulietti at openjdk.org Thu Apr 10 11:49:30 2025 From: rgiulietti at openjdk.org (Raffaello Giulietti) Date: Thu, 10 Apr 2025 11:49:30 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 10:14:40 GMT, Magnus Ihse Bursie wrote: >> I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. >> >> BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed. >> >> Methodology used: >> >> I have run four different tools for using different heuristics for determining the encoding of a file: >> * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5) >> * uchardet (a modern version by freedesktop, used by e.g. Firefox) >> * enca (targeted towards obscure code pages) >> * libmagic / `file --mime-encoding` >> >> They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further: >> * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` >> >> From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets: >> * All files where at least one tool claimed it to be UTF-8 >> * All files where at least one tool claimed it to be *not* UTF-8 >> >> For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay. >> >> For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure... > > src/hotspot/cpu/x86/macroAssembler_x86_sha.cpp line 497: > >> 495: /* >> 496: The algorithm below is based on Intel publication: >> 497: "Fast SHA-256 Implementations on Intel(R) Architecture Processors" by Jim Guilford, Kirk Yap and Vinodh Gopal. > > Note: There is of course a unicode `?` symbol, which is what it was originally before it was botched here, but I found no reason to keep this, and in the spirit of JDK-8354213, I thought it better to use pure ASCII here. I guess the difference at L.1 in the various files is just the BOM? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037161789 From ihse at openjdk.org Thu Apr 10 13:17:24 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 10 Apr 2025 13:17:24 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 11:46:45 GMT, Raffaello Giulietti wrote: > I guess the difference at L.1 in the various files is just the BOM? Yes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037357899 From rgiulietti at openjdk.org Thu Apr 10 13:56:42 2025 From: rgiulietti at openjdk.org (Raffaello Giulietti) Date: Thu, 10 Apr 2025 13:56:42 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further: > * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure binary files, but without any telling exten... I only checked these 13 files to be UTF-8 encoded and without BOM. ------------- Marked as reviewed by rgiulietti (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24566#pullrequestreview-2756936848 From naoto at openjdk.org Thu Apr 10 17:12:26 2025 From: naoto at openjdk.org (Naoto Sato) Date: Thu, 10 Apr 2025 17:12:26 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further: > * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure binary files, but without any telling exten... src/java.desktop/share/legal/lcms.md line 72: > 70: Mateusz Jurczyk (Google) > 71: Paul Miller > 72: S?bastien L?on I cannot comment on capitalization here, but if we wanted to lowercase them, should they be e-grave instead of e-acute? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037895884 From rgiulietti at openjdk.org Thu Apr 10 17:26:30 2025 From: rgiulietti at openjdk.org (Raffaello Giulietti) Date: Thu, 10 Apr 2025 17:26:30 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 17:09:27 GMT, Naoto Sato wrote: >> I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. >> >> BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed. >> >> Methodology used: >> >> I have run four different tools for using different heuristics for determining the encoding of a file: >> * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5) >> * uchardet (a modern version by freedesktop, used by e.g. Firefox) >> * enca (targeted towards obscure code pages) >> * libmagic / `file --mime-encoding` >> >> They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further: >> * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` >> >> From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets: >> * All files where at least one tool claimed it to be UTF-8 >> * All files where at least one tool claimed it to be *not* UTF-8 >> >> For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay. >> >> For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure... > > src/java.desktop/share/legal/lcms.md line 72: > >> 70: Mateusz Jurczyk (Google) >> 71: Paul Miller >> 72: S?bastien L?on > > I cannot comment on capitalization here, but if we wanted to lowercase them, should they be e-grave instead of e-acute? If this is a French name, it's e acute: ?. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2037917708 From erikj at openjdk.org Thu Apr 10 17:37:26 2025 From: erikj at openjdk.org (Erik Joelsson) Date: Thu, 10 Apr 2025 17:37:26 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: <4fRjwM-P0XuOWk9QjYl9zji51zLn7wwsFKlo7tJt3JM=.976560e0-39c6-4633-bc8d-279deb1ebea3@github.com> On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further: > * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure binary files, but without any telling exten... Marked as reviewed by erikj (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/24566#pullrequestreview-2757703868 From naoto at openjdk.org Thu Apr 10 17:41:25 2025 From: naoto at openjdk.org (Naoto Sato) Date: Thu, 10 Apr 2025 17:41:25 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: <7hmmP0I0kH0UiF8cV-CkNnpdQFkddrt3TYEkFltoj8U=.3bf6bcbf-3771-4628-82e0-f678f7366d8a@github.com> On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further: > * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure binary files, but without any telling exten... Marked as reviewed by naoto (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/24566#pullrequestreview-2757716905 From eirbjo at openjdk.org Thu Apr 10 18:33:26 2025 From: eirbjo at openjdk.org (Eirik =?UTF-8?B?QmrDuHJzbsO4cw==?=) Date: Thu, 10 Apr 2025 18:33:26 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 17:23:37 GMT, Raffaello Giulietti wrote: > If this is a French name, it's e acute: ?. Supported by this Wikipedia page listing S.L as an LCMS developer: https://en.wikipedia.org/wiki/Little_CMS ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2038022994 From eirbjo at openjdk.org Thu Apr 10 18:45:28 2025 From: eirbjo at openjdk.org (Eirik =?UTF-8?B?QmrDuHJzbsO4cw==?=) Date: Thu, 10 Apr 2025 18:45:28 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further: > * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure binary files, but without any telling exten... src/java.desktop/share/legal/lcms.md line 103: > 101: Tim Zaman > 102: Amir Montazery and Open Source Technology Improvement Fund (ostif.org), Google, for fuzzer fundings. > 103: ``` This introduces an empty trailing line. I see you have removed trailing whitespace elsewhere. Was this intentional, to avoid the file ending with the three ticks? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2038071768 From jlu at openjdk.org Thu Apr 10 18:47:53 2025 From: jlu at openjdk.org (Justin Lu) Date: Thu, 10 Apr 2025 18:47:53 GMT Subject: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v2] In-Reply-To: <0q0gTsqIsYtmzAfNYbBXksUXKdZh2uzQ9yvSETKAP88=.137372e6-d63e-4539-b196-4bd9ef1ddd16@github.com> References: <0q0gTsqIsYtmzAfNYbBXksUXKdZh2uzQ9yvSETKAP88=.137372e6-d63e-4539-b196-4bd9ef1ddd16@github.com> Message-ID: <9aQcWun5KNgHgELVwkc3478_RtqfhRL1Cxvyn2Yl0Nw=.07ee596f-e738-4796-8d27-14621ed8860c@github.com> On Thu, 10 Apr 2025 08:44:28 GMT, Eirik Bj?rsn?s wrote: >> Justin Lu has updated the pull request incrementally with one additional commit since the last revision: >> >> Replace InputStreamReader with BufferedReader > > FWIW, I checked out the revision of the commit previous to this change and found the following: > > > % git checkout b55e418a077791b39992042411cde97f68dc39fe^ > % find src -name "*.properties" | xargs file | grep -v ASCII > src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/Encodings.properties: > ISO-8859 text > src/java.xml.crypto/share/classes/com/sun/org/apache/xml/internal/security/resource/xmlsecurity_de.properties: > Unicode text, UTF-8 text, with very long lines (322) > > > Which indicates that that this is the only non-ASCII, non-UTF-8 property file. So we may be lucky. This conversion was performed under the assumption of ASCII set and Unicode escape sequences, which is the format we expect for the translation process for .properties files. That file should have been omitted from this change. Thank you @eirbjo and @magicus for the analysis and checking! ------------- PR Comment: https://git.openjdk.org/jdk/pull/15694#issuecomment-2794828598 From eirbjo at openjdk.org Thu Apr 10 19:09:35 2025 From: eirbjo at openjdk.org (Eirik =?UTF-8?B?QmrDuHJzbsO4cw==?=) Date: Thu, 10 Apr 2025 19:09:35 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further: > * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure binary files, but without any telling exten... LGTM. There are some whitepace releated changes in this PR which seem okay, but has no mention in either the JBS or PR description. Perhaps a short mention of this intention in either place would be good for future historians. (BTW, I enjoyed seeing separate commits for the encoding and BOM changes, makes it easier to verify each!) ------------- Marked as reviewed by eirbjo (Committer). PR Review: https://git.openjdk.org/jdk/pull/24566#pullrequestreview-2758055634 From ihse at openjdk.org Thu Apr 10 21:28:31 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 10 Apr 2025 21:28:31 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further: > * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure binary files, but without any telling exten... The whitespace changes are my editor removing whitespaces at the end of a line. This is a thing we enforce for many files types, but the check does not yet formally include .txt files. I have been working from time to time with trying to extend the set of files covered by this check, so I have in general not tried to circumvent my editor when it strips trailing whitespaces even for files that we do not yet require no trailing whitespaces in jcheck. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24566#issuecomment-2795201480 From ihse at openjdk.org Thu Apr 10 21:28:32 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 10 Apr 2025 21:28:32 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: <1IvhgoM9LMGg7s2kq_N0V7F1GCh-xFBnauZ9Ajk2Txo=.672329ea-e4c9-437c-a8b7-0502a9fdd414@github.com> On Thu, 10 Apr 2025 19:06:35 GMT, Eirik Bj?rsn?s wrote: > (BTW, I enjoyed seeing separate commits for the encoding and BOM changes, makes it easier to verify each!) Thanks! I do very much like myself to review PRs that has separate logical commits, so I try to produce such myself. I'm glad to hear it was appreciated. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24566#issuecomment-2795203125 From ihse at openjdk.org Thu Apr 10 21:28:32 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Thu, 10 Apr 2025 21:28:32 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 18:30:22 GMT, Eirik Bj?rsn?s wrote: >> If this is a French name, it's e acute: ?. > >> If this is a French name, it's e acute: ?. > > Supported by this Wikipedia page listing S.L as an LCMS developer: > > https://en.wikipedia.org/wiki/Little_CMS It's not a mistake in capitalization, it's a mistake for two different characters in two different encodings. (Probably iso-8859-1 mistaken as ansi iirc.) I verified the developers name at the original file in the LCMS repo. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2038362034 From rriggs at openjdk.org Thu Apr 10 22:10:43 2025 From: rriggs at openjdk.org (Roger Riggs) Date: Thu, 10 Apr 2025 22:10:43 GMT Subject: RFR: 8354335: No longer deprecate wrapper class constructors for removal Message-ID: Remove forRemoval = true from @Deprecated annotation of Boolean, Byte, Character, Double, Float, Integer, Long, Short. And add `SuppressWarnings("deprecation") `where needed; and remove `SuppressWarnings("removal")` ------------- Commit messages: - 8354335: No longer deprecate wrapper class constructors for removal Changes: https://git.openjdk.org/jdk/pull/24586/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24586&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8354335 Stats: 23 lines in 9 files changed: 0 ins; 0 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/24586.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24586/head:pull/24586 PR: https://git.openjdk.org/jdk/pull/24586 From liach at openjdk.org Thu Apr 10 23:43:23 2025 From: liach at openjdk.org (Chen Liang) Date: Thu, 10 Apr 2025 23:43:23 GMT Subject: RFR: 8354335: No longer deprecate wrapper class constructors for removal In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 22:05:04 GMT, Roger Riggs wrote: > Remove forRemoval = true from @Deprecated annotation of Boolean, Byte, Character, Double, Float, Integer, Long, Short. > And add `SuppressWarnings("deprecation") `where needed; and remove `SuppressWarnings("removal")` The wrapper classes and MemberName changes look good. ------------- Marked as reviewed by liach (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24586#pullrequestreview-2758769422 From serb at openjdk.org Fri Apr 11 03:37:29 2025 From: serb at openjdk.org (Sergey Bylokhov) Date: Fri, 11 Apr 2025 03:37:29 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further: > * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure binary files, but without any telling exten... src/demo/share/java2d/J2DBench/resources/textdata/arabic.ut8.txt line 11: > 9: ???????? ???????????? ?????????????? "??????????????" ???????? ?????????? ?????? ???????? ???? ???????? ???????????? ?????????????????? ???????? ?????? ?????????? ???? ?????? ?????????????? ???? ?????????????? ??????????????????. ?????? ?????? ???????? ???????????? "??????????????" ???????? ???????? ???????? ???????????????? ???????????? ???????????????? ?????? ?????????????? ?????? ?????????? ????.????.????. (IBM)?? ???????? (APPLE)?? ???????????????????? ?????????????? (Hewlett-Packard) ?? ???????????????????? (Microsoft)?? ???????????????? (Oracle) ?? ???? (Sun) ????????????. ?????? ???? ?????????????????? ?????????????????? ?????????????? (?????? ?????? ?????????????? "????????" "JAVA" ???????? "?????? ???? ????" "XML" ???????? ???????????? ???????????? ??????????????????) ?????????? ?????????????? "??????????????". ?????????? ?????? ?????? ?? ?????? "??????????????" ???? ???????????????????? ???????????????? ???????????? ???????????????? ???????????????????? ???????? ?????? ???? (ISO 10646) . > 10: > 11: ???? ???????? ???????????? "??????????????" ?????????????? ?????????????? ???????? ?????????????? ?????????????? ?????????? ???? ?????? ???????????????????? ?????????????? ???? ?????????? ?????????????????? ?????????? ???????????? ???? ????????????. ?????? ?????????????? "??????????????" ???? ???????? ?????????????????? ?????????? ?????? ?????????? ???????? ???????????? ???? ?????????????? ?????????????????? ?????????????????? ?????????????? ??????????????. ?????? ???? ?????????????? "??????????????" ???????????????? ?????????????? ???? ?????????? ???????????????? ?????? ???????????? ?????????????????? ?????? ???? ?????? ???? ?????????????? ???? ???????????????? ???????? ?????? ???? ???????? ???? ???????????? ?????????? ?????????? ?????? ???????????? ???????????? ?????????????? ???? ?????????? ???? ??????????. ?????????????? ?????? ?????????????? "??????????????" ?????????? ???????????????? ???? ???????????????? ?????? ?????????????? ???????????????? ???????????????? ?????? ??? ? ?????????? ?????????????????? ???????? ?????????? ?????????????? ?????????????? ?????????????? ???????????????? ???????????? ???????? ?????? ???? ???????????? ?????? ????????????????. Looks like most of the changes in java2d/* are related to spaces at the end of the line? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2038746193 From ihse at openjdk.org Fri Apr 11 10:27:40 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Fri, 11 Apr 2025 10:27:40 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: On Fri, 11 Apr 2025 03:35:11 GMT, Sergey Bylokhov wrote: >> I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. >> >> BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed. >> >> Methodology used: >> >> I have run four different tools for using different heuristics for determining the encoding of a file: >> * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5) >> * uchardet (a modern version by freedesktop, used by e.g. Firefox) >> * enca (targeted towards obscure code pages) >> * libmagic / `file --mime-encoding` >> >> They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further: >> * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` >> >> From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets: >> * All files where at least one tool claimed it to be UTF-8 >> * All files where at least one tool claimed it to be *not* UTF-8 >> >> For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay. >> >> For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure... > > src/demo/share/java2d/J2DBench/resources/textdata/arabic.ut8.txt line 11: > >> 9: ???????? ???????????? ?????????????? "??????????????" ???????? ?????????? ?????? ???????? ???? ???????? ???????????? ?????????????????? ???????? ?????? ?????????? ???? ?????? ?????????????? ???? ?????????????? ??????????????????. ?????? ?????? ???????? ???????????? "??????????????" ???????? ???????? ???????? ???????????????? ???????????? ???????????????? ?????? ?????????????? ?????? ?????????? ????.????.????. (IBM)?? ???????? (APPLE)?? ???????????????????? ?????????????? (Hewlett-Packard) ?? ???????????????????? (Microsoft)?? ???????????????? (Oracle) ?? ???? (Sun) ????????????. ?????? ???? ?????????????????? ?????????????????? ?????????????? (?????? ?????? ?????????????? "????????" "JAVA" ???????? "?????? ???? ????" "XML" ???????? ???????????? ???????????? ??????????????????) ?????????? ?????????????? "??????????????". ?????????? ?????? ?????? ?? ?????? "??????????????" ???? ???????????????????? ???????????????? ???????????? ???????????????? ???????????????????? ???????? ????? ????? (ISO 10646) . >> 10: >> 11: ???? ???????? ???????????? "??????????????" ?????????????? ?????????????? ???????? ?????????????? ?????????????? ?????????? ???? ?????? ???????????????????? ?????????????? ???? ?????????? ?????????????????? ?????????? ???????????? ???? ????????????. ?????? ?????????????? "??????????????" ???? ???????? ?????????????????? ?????????? ?????? ?????????? ???????? ???????????? ???? ?????????????? ?????????????????? ?????????????????? ?????????????? ??????????????. ?????? ???? ?????????????? "??????????????" ???????????????? ?????????????? ???? ?????????? ???????????????? ?????? ???????????? ?????????????????? ?????? ???? ?????? ???? ?????????????? ???? ???????????????? ???????? ?????? ???? ???????? ???? ???????????? ?????????? ?????????? ?????? ???????????? ???????????? ?????????????? ???? ?????????? ???? ??????????. ?????????????? ?????? ?????????????? "??????????????" ?????????? ???????????????? ???? ???????????????? ?????? ?????????????? ???????????????? ???????????????? ?????? ?? ?? ?????????? ?????????????????? ???????? ?????????? ?????????????? ?????????????? ?????????????? ???????????????? ???????????? ???????? ?????? ???? ???????????? ?????? ????????????????. > > Looks like most of the changes in java2d/* are related to spaces at the end of the line? No, that are just incidental changes (see https://github.com/openjdk/jdk/pull/24566#issuecomment-2795201480). The actual change for the java2d files is the removal of the initial UTF-8 BOM. Github has a hard time showing this though, since the BOM is not visible. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2039258980 From eirbjo at openjdk.org Fri Apr 11 10:27:40 2025 From: eirbjo at openjdk.org (Eirik =?UTF-8?B?QmrDuHJzbsO4cw==?=) Date: Fri, 11 Apr 2025 10:27:40 GMT Subject: RFR: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: On Fri, 11 Apr 2025 10:21:32 GMT, Magnus Ihse Bursie wrote: >> src/demo/share/java2d/J2DBench/resources/textdata/arabic.ut8.txt line 11: >> >>> 9: ???????? ???????????? ?????????????? "??????????????" ???????? ?????????? ?????? ???????? ???? ???????? ???????????? ?????????????????? ???????? ?????? ?????????? ???? ?????? ?????????????? ???? ?????????????? ??????????????????. ?????? ?????? ???????? ???????????? "??????????????" ???????? ???????? ???????? ???????????????? ???????????? ???????????????? ?????? ?????????????? ?????? ?????????? ????.????.????. (IBM)?? ???????? (APPLE)?? ???????????????????? ?????????????? (Hewlett-Packard) ?? ???????????????????? (Microsoft)?? ???????????????? (Oracle) ?? ???? (Sun) ????????????. ?????? ???? ?????????????????? ?????????????????? ?????????????? (?????? ?????? ?????????????? "????????" "JAVA" ???????? "?????? ???? ????" "XML" ???????? ???????????? ???????????? ??????????????????) ?????????? ?????????????? "??????????????". ?????????? ?????? ?????? ?? ?????? "??????????????" ???? ???????????????????? ???????????????? ???????????? ???????????????? ???????????????????? ???????? ???? ?????? (ISO 10646) . >>> 10: >>> 11: ???? ???????? ???????????? "??????????????" ?????????????? ?????????????? ???????? ?????????????? ?????????????? ?????????? ???? ?????? ???????????????????? ?????????????? ???? ?????????? ?????????????????? ?????????? ???????????? ???? ????????????. ?????? ?????????????? "??????????????" ???? ???????? ?????????????????? ?????????? ?????? ?????????? ???????? ???????????? ???? ?????????????? ?????????????????? ?????????????????? ?????????????? ??????????????. ?????? ???? ?????????????? "??????????????" ???????????????? ?????????????? ???? ?????????? ???????????????? ?????? ???????????? ?????????????????? ?????? ???? ?????? ???? ?????????????? ???? ???????????????? ???????? ?????? ???? ???????? ???? ???????????? ?????????? ?????????? ?????? ???????????? ???????????? ?????????????? ???? ?????????? ???? ??????????. ?????????????? ?????? ?????????????? "??????????????" ?????????? ???????????????? ???? ???????????????? ?????? ?????????????? ???????????????? ???????????????? ?????? ? ??? ?????????? ?????????????????? ???????? ?????????? ?????????????? ?????????????? ?????????????? ???????????????? ???????????? ???????? ?????? ???? ???????????? ?????? ????????????????. >> >> Looks like most of the changes in java2d/* are related to spaces at the end of the line? > > No, that are just incidental changes (see https://github.com/openjdk/jdk/pull/24566#issuecomment-2795201480). The actual change for the java2d files is the removal of the initial UTF-8 BOM. Github has a hard time showing this though, since the BOM is not visible. I found the side-by-side diff in IntelliJ useful here, as it said "UTF-8 BOM" vs. "UTF-8". ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24566#discussion_r2039263227 From ihse at openjdk.org Fri Apr 11 10:27:40 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Fri, 11 Apr 2025 10:27:40 GMT Subject: Integrated: 8354266: Fix non-UTF-8 text encoding In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 10:10:49 GMT, Magnus Ihse Bursie wrote: > I have checked the entire code base for incorrect encodings, but luckily enough these were the only remaining problems I found. > > BOM (byte-order mark) is a method used for distinguishing big and little endian UTF-16 encodings. There is a special UTF-8 BOM, but it is discouraged. In the words of the Unicode Consortium: "Use of a BOM is neither required nor recommended for UTF-8". We have UTF-8 BOMs in a handful of files. These should be removed. > > Methodology used: > > I have run four different tools for using different heuristics for determining the encoding of a file: > * chardetect (the original, slow-as-molasses Perl program, which also had the worst performing heuristics of all; I'll rate it 1/5) > * uchardet (a modern version by freedesktop, used by e.g. Firefox) > * enca (targeted towards obscure code pages) > * libmagic / `file --mime-encoding` > > They all agreed on pure ASCII files (which is easy to check), and these I just ignored/accepted as good. The handling of pure binary files differed between the tools; most detected them as binary but some suggested arcane encodings for specific (often small) binary files. To keep my sanity, I decided that files ending in any of these extensions were binary, and I did not check them further: > * `gif|png|ico|jpg|icns|tiff|wav|woff|woff2|jar|ttf|bmp|class|crt|jks|keystore|ks|db` > > From the remaining list of non-ascii, non-known-binary files I selected two overlapping and exhaustive subsets: > * All files where at least one tool claimed it to be UTF-8 > * All files where at least one tool claimed it to be *not* UTF-8 > > For the first subset, I checked every non-ASCII character (using `C_ALL=C ggrep -H --color='auto' -P -n "[^\x00-\x7F]" $(cat names-of-files-to-check.txt)`, and visually examining the results). At this stage, I found several files where unicode were unnecessarily used instead of pure ASCII, and I treated those files separately. Other from that, my inspection revealed no obvious encoding errors. This list comprised of about 2000 files, so I did not spend too much time on each file. The assumption, after all, was that these files are okay. > > For the second subset, I checked every non-ASCII character (using the same method). This list was about 300+ files. Most of them were okay far as I can tell; I can confirm encodings for European languages 100%, but JCK encodings could theoretically be wrong; they looked sane but I cannot read and confirm fully. Several were in fact pure binary files, but without any telling exten... This pull request has now been integrated. Changeset: d4e194bc Author: Magnus Ihse Bursie URL: https://git.openjdk.org/jdk/commit/d4e194bc463ff3ad09e55cbb96bea00283679ce6 Stats: 32 lines in 13 files changed: 0 ins; 2 del; 30 mod 8354266: Fix non-UTF-8 text encoding Reviewed-by: rgiulietti, erikj, naoto, eirbjo ------------- PR: https://git.openjdk.org/jdk/pull/24566 From naoto at openjdk.org Fri Apr 11 17:08:26 2025 From: naoto at openjdk.org (Naoto Sato) Date: Fri, 11 Apr 2025 17:08:26 GMT Subject: RFR: 8343157: Examine large files for character encoding/decoding Message-ID: Removing old charset test cases that verify new charset implementations (as of JDK7). Removed tests/files are actual charset implementations used in pre-JDK7, which have been used for comparing the results. Since those "new" implementations have been used since then, I believe it is OK to retire those old test cases. ------------- Commit messages: - initial commit Changes: https://git.openjdk.org/jdk/pull/24597/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24597&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8343157 Stats: 164679 lines in 55 files changed: 0 ins; 164677 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/24597.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24597/head:pull/24597 PR: https://git.openjdk.org/jdk/pull/24597 From bchristi at openjdk.org Fri Apr 11 20:17:26 2025 From: bchristi at openjdk.org (Brent Christian) Date: Fri, 11 Apr 2025 20:17:26 GMT Subject: RFR: 8354335: No longer deprecate wrapper class constructors for removal In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 22:05:04 GMT, Roger Riggs wrote: > Remove forRemoval = true from @Deprecated annotation of Boolean, Byte, Character, Double, Float, Integer, Long, Short. > And add `SuppressWarnings("deprecation") `where needed; and remove `SuppressWarnings("removal")` LGTM ------------- Marked as reviewed by bchristi (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24586#pullrequestreview-2761490305 From iris at openjdk.org Fri Apr 11 20:24:25 2025 From: iris at openjdk.org (Iris Clark) Date: Fri, 11 Apr 2025 20:24:25 GMT Subject: RFR: 8354335: No longer deprecate wrapper class constructors for removal In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 22:05:04 GMT, Roger Riggs wrote: > Remove forRemoval = true from @Deprecated annotation of Boolean, Byte, Character, Double, Float, Integer, Long, Short. > And add `SuppressWarnings("deprecation") `where needed; and remove `SuppressWarnings("removal")` Marked as reviewed by iris (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/24586#pullrequestreview-2761501284 From alanb at openjdk.org Sat Apr 12 05:51:33 2025 From: alanb at openjdk.org (Alan Bateman) Date: Sat, 12 Apr 2025 05:51:33 GMT Subject: RFR: 8343157: Examine large files for character encoding/decoding In-Reply-To: References: Message-ID: On Fri, 11 Apr 2025 17:02:13 GMT, Naoto Sato wrote: > Removing old charset test cases that verify new charset implementations (as of JDK7). Removed tests/files are actual charset implementations used in pre-JDK7, which have been used for comparing the results. Since those "new" implementations have been used since then, I believe it is OK to retire those old test cases. Okay to delete, no real value keeping these. ------------- Marked as reviewed by alanb (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24597#pullrequestreview-2762069633 From ihse at openjdk.org Sun Apr 13 22:50:37 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Sun, 13 Apr 2025 22:50:37 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 Message-ID: This is a WIP to move the JDK source code base to fully UTF-8, and to ensure tools knows about this. ------------- Commit messages: - Fix flags for Windows - Mark java and native source code as utf-8 - Don't convert properties files to iso-8859-1. - Tell tools we use utf-8 - Replace iso-8859-1 encodings with utf-8 in source code - Explain reason for non-UTF-8 character in JDK_RCFLAGS Changes: https://git.openjdk.org/jdk/pull/24574/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24574&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8301971 Stats: 130 lines in 8 files changed: 17 ins; 103 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/24574.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24574/head:pull/24574 PR: https://git.openjdk.org/jdk/pull/24574 From ihse at openjdk.org Sun Apr 13 22:58:26 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Sun, 13 Apr 2025 22:58:26 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 In-Reply-To: References: Message-ID: <0io2A_4xFMiR8rwbXPPyYyXar_fwE1jG4K81pY_heUU=.18d9f809-dafc-4900-82fa-6478eb50b8de@github.com> On Thu, 10 Apr 2025 14:28:02 GMT, Magnus Ihse Bursie wrote: > Most of the JDK code base has been transitioned to UTF-8, but not all. This has recently become an acute problem, since our mixing of iso-8859-1 and utf-8 in properties files confused the version of `sed` that is shipped with the new macOS 15.4. > > The fix is basically simple, and includes the following steps: > * Look through the code base for text files containing non-ASCII characters, and convert them to UTF-8, if they are not already > * Update tooling used in building to recognize the fact that files are now in UTF-8 and treat them accordingly (basically, updating compiler flags, git attributes, etc). I would like to run proper tests to verify the changes in libjava, but I don't know what tests that would be. If anyone can enlighten me, please do. (I suspect that the code did not really work properly before, and that the specially encoded characters where not thoroughly tested, but I can be wrong.) ------------- PR Comment: https://git.openjdk.org/jdk/pull/24574#issuecomment-2800165519 From ihse at openjdk.org Sun Apr 13 23:14:41 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Sun, 13 Apr 2025 23:14:41 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 [v2] In-Reply-To: References: Message-ID: > Most of the JDK code base has been transitioned to UTF-8, but not all. This has recently become an acute problem, since our mixing of iso-8859-1 and utf-8 in properties files confused the version of `sed` that is shipped with the new macOS 15.4. > > The fix is basically simple, and includes the following steps: > * Look through the code base for text files containing non-ASCII characters, and convert them to UTF-8, if they are not already > * Update tooling used in building to recognize the fact that files are now in UTF-8 and treat them accordingly (basically, updating compiler flags, git attributes, etc). Magnus Ihse Bursie has updated the pull request incrementally with one additional commit since the last revision: Also tell javadoc that we have utf-8 now ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24574/files - new: https://git.openjdk.org/jdk/pull/24574/files/4fb897ef..38004164 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24574&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24574&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/24574.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24574/head:pull/24574 PR: https://git.openjdk.org/jdk/pull/24574 From ihse at openjdk.org Mon Apr 14 12:53:35 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Mon, 14 Apr 2025 12:53:35 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 [v3] In-Reply-To: References: Message-ID: > Most of the JDK code base has been transitioned to UTF-8, but not all. This has recently become an acute problem, since our mixing of iso-8859-1 and utf-8 in properties files confused the version of `sed` that is shipped with the new macOS 15.4. > > The fix is basically simple, and includes the following steps: > * Look through the code base for text files containing non-ASCII characters, and convert them to UTF-8, if they are not already > * Update tooling used in building to recognize the fact that files are now in UTF-8 and treat them accordingly (basically, updating compiler flags, git attributes, etc). Magnus Ihse Bursie has updated the pull request incrementally with three additional commits since the last revision: - Also document UTF-8 requirements (solves JDK-8338973) - Let configure only accept utf-8 locales - Address review comments from Kim ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24574/files - new: https://git.openjdk.org/jdk/pull/24574/files/38004164..452f42dc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24574&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24574&range=01-02 Stats: 47 lines in 7 files changed: 27 ins; 2 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/24574.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24574/head:pull/24574 PR: https://git.openjdk.org/jdk/pull/24574 From kbarrett at openjdk.org Mon Apr 14 12:53:53 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Mon, 14 Apr 2025 12:53:53 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 [v2] In-Reply-To: References: Message-ID: On Sun, 13 Apr 2025 23:14:41 GMT, Magnus Ihse Bursie wrote: >> Most of the JDK code base has been transitioned to UTF-8, but not all. This has recently become an acute problem, since our mixing of iso-8859-1 and utf-8 in properties files confused the version of `sed` that is shipped with the new macOS 15.4. >> >> The fix is basically simple, and includes the following steps: >> * Look through the code base for text files containing non-ASCII characters, and convert them to UTF-8, if they are not already >> * Update tooling used in building to recognize the fact that files are now in UTF-8 and treat them accordingly (basically, updating compiler flags, git attributes, etc). > > Magnus Ihse Bursie has updated the pull request incrementally with one additional commit since the last revision: > > Also tell javadoc that we have utf-8 now A couple of drive-by comments. Don't count me as a Reviewer for this. make/autoconf/flags-cflags.m4 line 577: > 575: elif test "x$TOOLCHAIN_TYPE" = xmicrosoft; then > 576: # The -utf-8 option sets source and execution character sets to UTF-8 to enable correct > 577: # compilation of all source files regardless of the active code page on Windows. Seems like this comment should be updated and moved near the new code block for setting up `CHARSET_CFLAGS`. make/common/JavaCompilation.gmk line 83: > 81: # The sed expression does this: > 82: # 1. Add a backslash before any :, = or ! that do not have a backslash already. > 83: # 3. Delete all lines starting with #. There is no item 2 anymore, so following bullets are misnumbered. ------------- Changes requested by kbarrett (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24574#pullrequestreview-2762999364 PR Review Comment: https://git.openjdk.org/jdk/pull/24574#discussion_r2041326051 PR Review Comment: https://git.openjdk.org/jdk/pull/24574#discussion_r2041328098 From ihse at openjdk.org Mon Apr 14 12:53:56 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Mon, 14 Apr 2025 12:53:56 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 [v2] In-Reply-To: References: Message-ID: On Sun, 13 Apr 2025 23:14:41 GMT, Magnus Ihse Bursie wrote: >> Most of the JDK code base has been transitioned to UTF-8, but not all. This has recently become an acute problem, since our mixing of iso-8859-1 and utf-8 in properties files confused the version of `sed` that is shipped with the new macOS 15.4. >> >> The fix is basically simple, and includes the following steps: >> * Look through the code base for text files containing non-ASCII characters, and convert them to UTF-8, if they are not already >> * Update tooling used in building to recognize the fact that files are now in UTF-8 and treat them accordingly (basically, updating compiler flags, git attributes, etc). > > Magnus Ihse Bursie has updated the pull request incrementally with one additional commit since the last revision: > > Also tell javadoc that we have utf-8 now Inspired by [Phil's comment in JDK-8353948](https://bugs.openjdk.org/browse/JDK-8353948?focusedId=14769043&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14769043), I also modified configure to only allow utf-8 environments, but to also allow `en_US.UTF-8` as a valid locale. This also resolves [JDK-8333247](https://bugs.openjdk.org/browse/JDK-8333247) in a better way. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24574#issuecomment-2800741990 From naoto at openjdk.org Mon Apr 14 16:12:47 2025 From: naoto at openjdk.org (Naoto Sato) Date: Mon, 14 Apr 2025 16:12:47 GMT Subject: RFR: 8343157: Examine large files for character encoding/decoding In-Reply-To: References: Message-ID: On Fri, 11 Apr 2025 17:02:13 GMT, Naoto Sato wrote: > Removing old charset test cases that verify new charset implementations (as of JDK7). Removed tests/files are actual charset implementations used in pre-JDK7, which have been used for comparing the results. Since those "new" implementations have been used since then, I believe it is OK to retire those old test cases. Thanks for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/24597#issuecomment-2802205991 From naoto at openjdk.org Mon Apr 14 16:12:47 2025 From: naoto at openjdk.org (Naoto Sato) Date: Mon, 14 Apr 2025 16:12:47 GMT Subject: Integrated: 8343157: Examine large files for character encoding/decoding In-Reply-To: References: Message-ID: On Fri, 11 Apr 2025 17:02:13 GMT, Naoto Sato wrote: > Removing old charset test cases that verify new charset implementations (as of JDK7). Removed tests/files are actual charset implementations used in pre-JDK7, which have been used for comparing the results. Since those "new" implementations have been used since then, I believe it is OK to retire those old test cases. This pull request has now been integrated. Changeset: d748bb5c Author: Naoto Sato URL: https://git.openjdk.org/jdk/commit/d748bb5cbb983fb07ae28e3a1c194058b73ef652 Stats: 164679 lines in 55 files changed: 0 ins; 164677 del; 2 mod 8343157: Examine large files for character encoding/decoding Reviewed-by: alanb ------------- PR: https://git.openjdk.org/jdk/pull/24597 From kbarrett at openjdk.org Mon Apr 14 17:36:47 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Mon, 14 Apr 2025 17:36:47 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 [v3] In-Reply-To: References: Message-ID: On Mon, 14 Apr 2025 12:53:35 GMT, Magnus Ihse Bursie wrote: >> Most of the JDK code base has been transitioned to UTF-8, but not all. This has recently become an acute problem, since our mixing of iso-8859-1 and utf-8 in properties files confused the version of `sed` that is shipped with the new macOS 15.4. >> >> The fix is basically simple, and includes the following steps: >> * Look through the code base for text files containing non-ASCII characters, and convert them to UTF-8, if they are not already >> * Update tooling used in building to recognize the fact that files are now in UTF-8 and treat them accordingly (basically, updating compiler flags, git attributes, etc). > > Magnus Ihse Bursie has updated the pull request incrementally with three additional commits since the last revision: > > - Also document UTF-8 requirements (solves JDK-8338973) > - Let configure only accept utf-8 locales > - Address review comments from Kim My comments have been addressed. Let's see if this is sufficient to clear my "request changes" state. ------------- PR Review: https://git.openjdk.org/jdk/pull/24574#pullrequestreview-2765099003 From serb at openjdk.org Tue Apr 15 23:23:46 2025 From: serb at openjdk.org (Sergey Bylokhov) Date: Tue, 15 Apr 2025 23:23:46 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 [v3] In-Reply-To: References: Message-ID: <7k8Vqbwnc5gQLdLWy6DMG3ReD0O68knX8T1OH4bdRZ8=.058d8240-f58f-4459-bd1e-e92981d6ae9b@github.com> On Mon, 14 Apr 2025 12:53:35 GMT, Magnus Ihse Bursie wrote: >> Most of the JDK code base has been transitioned to UTF-8, but not all. This has recently become an acute problem, since our mixing of iso-8859-1 and utf-8 in properties files confused the version of `sed` that is shipped with the new macOS 15.4. >> >> The fix is basically simple, and includes the following steps: >> * Look through the code base for text files containing non-ASCII characters, and convert them to UTF-8, if they are not already >> * Update tooling used in building to recognize the fact that files are now in UTF-8 and treat them accordingly (basically, updating compiler flags, git attributes, etc). > > Magnus Ihse Bursie has updated the pull request incrementally with three additional commits since the last revision: > > - Also document UTF-8 requirements (solves JDK-8338973) > - Let configure only accept utf-8 locales > - Address review comments from Kim can we also force this rule by the jcheck? ------------- PR Comment: https://git.openjdk.org/jdk/pull/24574#issuecomment-2807748235 From prr at openjdk.org Wed Apr 16 04:43:42 2025 From: prr at openjdk.org (Phil Race) Date: Wed, 16 Apr 2025 04:43:42 GMT Subject: RFR: 8354273: Restore even more pointless unicode characters to ASCII [v2] In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 10:36:31 GMT, Magnus Ihse Bursie wrote: >> As a follow-up to [JDK-8354213](https://bugs.openjdk.org/browse/JDK-8354213), I found some additional places where unicode characters are unnecessarily used instead of pure ASCII. > > Magnus Ihse Bursie has updated the pull request incrementally with one additional commit since the last revision: > > Remove incorrectly copied "?anchor" src/java.xml/share/legal/xhtml11.md line 50: > 48: or derived from [title and URI of the W3C document]." > 49: > 50: Disclaimers ?anchor Did that come from an upstream file ? test/jdk/java/awt/geom/Path2D/GetBounds2DPrecisionTest.java line 193: > 191: if (str.length() >= DIGIT_COUNT) { > 192: str = str.substring(0,DIGIT_COUNT-1)+"..."; > 193: } How did you test this ? Please say more than tiers 1-3 .. because this test isn't run until tier4. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24567#discussion_r2046043831 PR Review Comment: https://git.openjdk.org/jdk/pull/24567#discussion_r2046047435 From mdoerr at openjdk.org Wed Apr 16 07:56:47 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 16 Apr 2025 07:56:47 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 [v3] In-Reply-To: References: Message-ID: <_OtXyj0LCymmSCQhXmO-Ak_z5ZEYd5-tvqPp16TmXos=.8da4aecf-3538-4303-9b5a-2a59811642e0@github.com> On Mon, 14 Apr 2025 12:53:35 GMT, Magnus Ihse Bursie wrote: >> Most of the JDK code base has been transitioned to UTF-8, but not all. This has recently become an acute problem, since our mixing of iso-8859-1 and utf-8 in properties files confused the version of `sed` that is shipped with the new macOS 15.4. >> >> The fix is basically simple, and includes the following steps: >> * Look through the code base for text files containing non-ASCII characters, and convert them to UTF-8, if they are not already >> * Update tooling used in building to recognize the fact that files are now in UTF-8 and treat them accordingly (basically, updating compiler flags, git attributes, etc). > > Magnus Ihse Bursie has updated the pull request incrementally with three additional commits since the last revision: > > - Also document UTF-8 requirements (solves JDK-8338973) > - Let configure only accept utf-8 locales > - Address review comments from Kim We get the following problem on AIX: checking for locale to use... no UTF-8 locale found configure: error: No UTF-8 locale found. This is required for building successfully. configure exiting with result code 1 @varada1110, @JoKern65: Can you take a look, please? ------------- PR Comment: https://git.openjdk.org/jdk/pull/24574#issuecomment-2808717775 From ihse at openjdk.org Wed Apr 16 09:50:49 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Wed, 16 Apr 2025 09:50:49 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 [v3] In-Reply-To: <7k8Vqbwnc5gQLdLWy6DMG3ReD0O68knX8T1OH4bdRZ8=.058d8240-f58f-4459-bd1e-e92981d6ae9b@github.com> References: <7k8Vqbwnc5gQLdLWy6DMG3ReD0O68knX8T1OH4bdRZ8=.058d8240-f58f-4459-bd1e-e92981d6ae9b@github.com> Message-ID: On Tue, 15 Apr 2025 23:20:45 GMT, Sergey Bylokhov wrote: > can we also force this rule by the jcheck? Well, yes and no. First, we can verify that we do not have invalid UTF-8. That might be a signal that the encoding is wrong. But then this check needs to be able to distinguish between pure binary files that happen to look like improperly encoded UTF-8 files, and actually incorrectly encoded text files. In the end, this is likely to be more of an heuristic for a warning, rather than something we can block integration on. Secondly, files can have incorrect encodings but still pass as valid UTF-8. Only a human can tell that the content would be incorrect if we were to assume the encoding is UTF-8 instead of e.g. latin-1. This cannot be checked by jcheck, but must be caught by reviewers. I have beeb thinking, though, to add a warning to jcheck for adding non-ASCII characters to known text file types. As a general rule, this is acceptable but should only be done judiciously, so it would be good to have jcheck point it out. That would also give you an extra chance to verify the encoding. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24574#issuecomment-2809028487 From ihse at openjdk.org Wed Apr 16 09:55:40 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Wed, 16 Apr 2025 09:55:40 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 [v3] In-Reply-To: <_OtXyj0LCymmSCQhXmO-Ak_z5ZEYd5-tvqPp16TmXos=.8da4aecf-3538-4303-9b5a-2a59811642e0@github.com> References: <_OtXyj0LCymmSCQhXmO-Ak_z5ZEYd5-tvqPp16TmXos=.8da4aecf-3538-4303-9b5a-2a59811642e0@github.com> Message-ID: <6Kyy5kYllWxxLc6k2u-dF9dqmPcEQS74vEJO8rWG-D0=.0adee9b2-334c-473c-b0cc-1cbeb2774df6@github.com> On Wed, 16 Apr 2025 07:54:13 GMT, Martin Doerr wrote: > We get the following problem on AIX: > > ``` > checking for locale to use... no UTF-8 locale found > configure: error: No UTF-8 locale found. This is required for building successfully. > configure exiting with result code 1 > ``` This is (hopefully) more of a configuration issue than an issue with AIX per se. You can run `locale -a` to see all available locales, and see if there is any utf-8 locales at all. It might be that the naming scheme does not match `*.UTF-8`. Otherwise, you'd have to install the `C.UTF-8` or `en_US.UTF-8` locale. If no UTF-8 locales are available at all on AIX, then we might have to add some kind of exception. But beware that you will be building on an unsupported configuration in that case. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24574#issuecomment-2809035830 From mdoerr at openjdk.org Wed Apr 16 10:11:46 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 16 Apr 2025 10:11:46 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 [v3] In-Reply-To: <6Kyy5kYllWxxLc6k2u-dF9dqmPcEQS74vEJO8rWG-D0=.0adee9b2-334c-473c-b0cc-1cbeb2774df6@github.com> References: <_OtXyj0LCymmSCQhXmO-Ak_z5ZEYd5-tvqPp16TmXos=.8da4aecf-3538-4303-9b5a-2a59811642e0@github.com> <6Kyy5kYllWxxLc6k2u-dF9dqmPcEQS74vEJO8rWG-D0=.0adee9b2-334c-473c-b0cc-1cbeb2774df6@github.com> Message-ID: On Wed, 16 Apr 2025 09:51:49 GMT, Magnus Ihse Bursie wrote: > `locale -a` C POSIX en_US.8859-15 en_US.IBM-858 en_US.ISO8859-1 en_US I don't know if UTF-8 can be installed. If so, we should also document that as requirement for AIX build machines. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24574#issuecomment-2809046398 From ihse at openjdk.org Wed Apr 16 10:11:52 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Wed, 16 Apr 2025 10:11:52 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 [v3] In-Reply-To: References: Message-ID: On Mon, 14 Apr 2025 12:53:35 GMT, Magnus Ihse Bursie wrote: >> Most of the JDK code base has been transitioned to UTF-8, but not all. This has recently become an acute problem, since our mixing of iso-8859-1 and utf-8 in properties files confused the version of `sed` that is shipped with the new macOS 15.4. >> >> The fix is basically simple, and includes the following steps: >> * Look through the code base for text files containing non-ASCII characters, and convert them to UTF-8, if they are not already >> * Update tooling used in building to recognize the fact that files are now in UTF-8 and treat them accordingly (basically, updating compiler flags, git attributes, etc). > > Magnus Ihse Bursie has updated the pull request incrementally with three additional commits since the last revision: > > - Also document UTF-8 requirements (solves JDK-8338973) > - Let configure only accept utf-8 locales > - Address review comments from Kim ? It's kind of a wonder that you have been able to build at all so far..! ------------- PR Comment: https://git.openjdk.org/jdk/pull/24574#issuecomment-2809055178 From ihse at openjdk.org Wed Apr 16 10:11:57 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Wed, 16 Apr 2025 10:11:57 GMT Subject: RFR: 8354273: Restore even more pointless unicode characters to ASCII [v2] In-Reply-To: References: Message-ID: On Wed, 16 Apr 2025 04:39:22 GMT, Phil Race wrote: >> Magnus Ihse Bursie has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove incorrectly copied "?anchor" > > src/java.xml/share/legal/xhtml11.md line 50: > >> 48: or derived from [title and URI of the W3C document]." >> 49: >> 50: Disclaimers ?anchor > > Did that come from an upstream file ? No, it is copy/pasted from a textual rendering of the html file specified in the URL above. This is what you get if you na?vely select the text in Firefox and press Ctrl-C. The `?anchor` part is not rendered on screen. > test/jdk/java/awt/geom/Path2D/GetBounds2DPrecisionTest.java line 193: > >> 191: if (str.length() >= DIGIT_COUNT) { >> 192: str = str.substring(0,DIGIT_COUNT-1)+"..."; >> 193: } > > How did you test this ? Please say more than tiers 1-3 .. because this test isn't run until tier4. I did not test tier4. Will do so now. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24567#discussion_r2046572753 PR Review Comment: https://git.openjdk.org/jdk/pull/24567#discussion_r2046573122 From mbaesken at openjdk.org Wed Apr 16 10:37:42 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Wed, 16 Apr 2025 10:37:42 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 [v3] In-Reply-To: References: Message-ID: On Mon, 14 Apr 2025 12:53:35 GMT, Magnus Ihse Bursie wrote: >> Most of the JDK code base has been transitioned to UTF-8, but not all. This has recently become an acute problem, since our mixing of iso-8859-1 and utf-8 in properties files confused the version of `sed` that is shipped with the new macOS 15.4. >> >> The fix is basically simple, and includes the following steps: >> * Look through the code base for text files containing non-ASCII characters, and convert them to UTF-8, if they are not already >> * Update tooling used in building to recognize the fact that files are now in UTF-8 and treat them accordingly (basically, updating compiler flags, git attributes, etc). > > Magnus Ihse Bursie has updated the pull request incrementally with three additional commits since the last revision: > > - Also document UTF-8 requirements (solves JDK-8338973) > - Let configure only accept utf-8 locales > - Address review comments from Kim make/autoconf/basic.m4 line 155: > 153: else > 154: AC_MSG_RESULT([no UTF-8 locale found]) > 155: AC_MSG_ERROR([No UTF-8 locale found. This is required for building successfully.]) Seems we run into this 'else' part on AIX checking for locale to use... no UTF-8 locale found configure: error: No UTF-8 locale found. This is required for building successfully. configure exiting with result code 1 maybe it would be nice to display the desired ones C.UTF-8 or en_US.UTF-8 in this message too for more clarity? (have to check if there are other names on AIX) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24574#discussion_r2046642699 From ihse at openjdk.org Wed Apr 16 13:44:52 2025 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Wed, 16 Apr 2025 13:44:52 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 [v3] In-Reply-To: References: Message-ID: On Wed, 16 Apr 2025 10:35:02 GMT, Matthias Baesken wrote: >> Magnus Ihse Bursie has updated the pull request incrementally with three additional commits since the last revision: >> >> - Also document UTF-8 requirements (solves JDK-8338973) >> - Let configure only accept utf-8 locales >> - Address review comments from Kim > > make/autoconf/basic.m4 line 155: > >> 153: else >> 154: AC_MSG_RESULT([no UTF-8 locale found]) >> 155: AC_MSG_ERROR([No UTF-8 locale found. This is required for building successfully.]) > > Seems we run into this 'else' part on AIX > > > checking for locale to use... no UTF-8 locale found > configure: error: No UTF-8 locale found. This is required for building successfully. > configure exiting with result code 1 > > maybe it would be nice to display the desired ones C.UTF-8 or en_US.UTF-8 in this message too for more clarity? (have to check if there are other names on AIX) If you have a locale named `.UTF-8` as your active locale, that will also be accepted, so it is not limited to C and en_US. But it might be an idea to include it in the error message, yes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24574#discussion_r2046971091 From naoto at openjdk.org Wed Apr 16 16:17:49 2025 From: naoto at openjdk.org (Naoto Sato) Date: Wed, 16 Apr 2025 16:17:49 GMT Subject: RFR: 8301971: Make JDK source code UTF-8 [v3] In-Reply-To: References: Message-ID: On Mon, 14 Apr 2025 12:53:35 GMT, Magnus Ihse Bursie wrote: >> Most of the JDK code base has been transitioned to UTF-8, but not all. This has recently become an acute problem, since our mixing of iso-8859-1 and utf-8 in properties files confused the version of `sed` that is shipped with the new macOS 15.4. >> >> The fix is basically simple, and includes the following steps: >> * Look through the code base for text files containing non-ASCII characters, and convert them to UTF-8, if they are not already >> * Update tooling used in building to recognize the fact that files are now in UTF-8 and treat them accordingly (basically, updating compiler flags, git attributes, etc). > > Magnus Ihse Bursie has updated the pull request incrementally with three additional commits since the last revision: > > - Also document UTF-8 requirements (solves JDK-8338973) > - Let configure only accept utf-8 locales > - Address review comments from Kim We will probably need to make sure things are ok on Windows as well (they are the other confusing environment) ------------- PR Comment: https://git.openjdk.org/jdk/pull/24574#issuecomment-2810074157