<i18n dev> Date parsing issues with SimpleDateFormat and DateTimeFormatter

Tue Oct 10 17:24:14 UTC 2023

Hi Lothar,

The issue you are seeing here is the one we are aware of for a long 
time: https://bugs.openjdk.org/browse/JDK-8194289

The difference between JDK8 and JDK11 comes from the difference between 
JDK's legacy locale data and the CLDR locale data provided by the 
Unicode Consortium. The CLDR locale data became the default locale data 
in JDK9 (https://openjdk.org/jeps/252)

The locale data changes from time to time, from small ones such as 
translation changes to the somewhat significant ones, so the best course 
of action is not relying on any specific locale data. However, I see 
your situation is to parse dates from outside the JDK, so I understand 
you will need to adapt to whatever the date format the app receives.

As the immediate temporary solution, you could choose the legacy locale 
data over CLDR, by using the system property 
`java.locale.providers=COMPAT` mentioned in those JBS issues. For a 
longer term, you might want to implement 
`java.text.spi.DateFormatProvider` for your specific needs (in this 
case, parse abbreviated months without the trailing dot in German).

Please note that the system property value `COMPAT` was a temporary 
measure for migration, so it is deprecated as of JDK21 and will 
eventually be removed in the future JDK.

Also, we are in the process of revising the old release notes wrt the 
CLDR compatibility issues so that users will know them beforehand.

HTH,
Naoto

On 10/10/23 3:47 AM, Lothar Kimmeringer wrote:
> Hi,
> 
> my first mail in this list, so please be gentle ;-)
> 
> I've encountered issues when trying to keep date parsing functionality when
> migrating from Java 8 to Java 11. This happened a while ago and I 
> implemented
> local workarounds but with installations using more recent versions of 
> Java 11
> things broke again so I'm not sure if I'm simply doing things wrong or 
> if there
> are actual bugs.
> 
> I've attached my JUnit-class that contains the different issues (not as 
> single
> tests but I will highlight them here in this mail. "Here" 
> SimpleDateFormat is
> used but I've added a test to use DateTimeFormatter to make sure that it's
> not the use of old classes and that the problem persists in the new API as
> well.
> 
> Most issues come up when trying to parse abbreviated months with Locales
> different from "en". Our use case is that data with the same date layout
> but different Locales are parsed (e.g. Ebay revenue summary CSV-files or 
> FTP
> servers on german Windows installations). The dates used there are of 
> the form
> 
>   - Ebay: 18. Mär 2023
>   - FTP server: Mär 14  2022
> 
> This worked well with MMM in the template till Java 8 then LLL got 
> introduced
> and MMM now leads to the use of four letters being used for the 
> abbreviation
> including a dot. Btw: I think the Javadoc that explains the template-parts
> (e.g. in SimpleDateFormat) should have an additional column containing 
> an example
> for a non-EN-Locale, because
> 
> M    Month in year (context sensitive)  Month  July; Jul; 07
> L    Month in year (standalone form)    Month  July; Jul; 07
> 
> isn't helping at all to see the effect of these two template-parts, so e.g.
> 
> M    Month ... (context...)             Month   January;...   Januar; 
> Jan.; 01
> M    Month ... (standalone...)          Month   January;...   Januar; 
> Jan; 01
> 
> might be better for understanding it.
> 
> With the use of LLL all tests with dates without a dot can now be parsed
> again using the same mask. But it's not possible to parse a date where
> the month is always abbreviated with a dot in a consistent way, e.g.
> 
> 23. Dez. 2016 11:12:13.456
> using the template
> dd. LLL. yyyy HH:mm:ss.SSS
> 
> It works with Locale en (with "Dec" as month of course) but not with "de".
> 
> Reason is that SimpleDateFormat is using all month display names when
> parsing "month standalone". That also includes the abbreviated month 
> including
> dots. Because these months are in general longer than their standalone
> counterparts (except three-letter months like "Mai" in german)
> matchString considers this as best match, "consuming" the dot in the 
> text to
> be parsed which is now missing when the parsing continues.
> 
> DateTimeFormatter seem to work differently because it's not failing at that
> point (haven't debugged it) but is failing when trying to parse russian 
> dates
> without abbreviating dots. I assume that is because the ru-Locale doesn't
> seem to have values for the standalone month. I could live with that given
> our user base but the parser in java.time runs into problems when parsing
> a time with milliseconds: You need to provide as many "S" as there are 
> digits
> in the value:
> 
>   - "23. Dec. 2016 11:12:13.456" needs "dd. LLL. yyyy HH:mm:ss.SSS",
>     it doesn't work with "dd. LLL. yyyy HH:mm:ss.S"
>   - "23. Dec. 2016 11:12:13.4"   needs "dd. LLL. yyyy HH:mm:ss.S",
>     it doesn't work with "dd. LLL. yyyy HH:mm:ss.SSS"
> 
> When handling data from different sources where one source is cutting
> away trailing zeros and the other isn't you essentially need to parse
> the date to be parsed to use the correct template being used for parsing.
> 
> SimpleDateFormat parses the date correctly independent from the
> number of "S" in the template and the actual number of digits
> in the text to be parsed.
> 
> While my lengthy explanation of the problems with LLL might result into
> the answer "not a bug, go away" ;-) I definitely see the milliseconds with
> java.time.* as one.
> 
> 
> Thanks for reading this far and best regards,
> 
> Lothar Kimmeringer