[loc-en-dev] A ISO639 3-letter code which has a 2-letter code

Yoshito Umaoka y.umaoka at gmail.com
Sun Mar 8 19:55:47 PDT 2009


The current JavaDoc for Locale constructors defines the language 
parameter is lowercase two-letter ISO-639 code.  The actual 
implementation convert the language code string to lowercase, but it 
does not check the length nor it's a valid ISO639.1 language code.  So 
you can use any arbitrary string as a language.

For supporting BCP47 language tag, we should update the description and 
explain that valid BCP47 langtag can be used, also in the existing 
constructor.  (This change itself does not introduce any implementation 
changes.. just documentation.)

The Locale constructor maps input language code "he" to "iw" for 
stability reason.  No matter you use "he" or "iw" as language code, 
Locale#getLanguage() returns "iw" for Hebrew language.  This 
implementation is tricky, but it also prevent two canonically equivalent 
Locales are created via APIs.

Now, we need to figure out what to do with ISO639 3-letter language 
codes which have ISO639.1 2-letter codes.  For example, English has 
ISO639.1 code "en" as well as ISO639.2/639.3 code "eng".  BCP47 itself 
prohibit a 3 letter code is used for a language if it has 2-letter version.

I think there are several possible options for this problem.

In Locale constructors -

1. Do nothing.  Locale constructors do not check if an input 3-letter 
language code has a 2-letter version.
2. Map. Locale constructors map 3-letter language code if there is 
2-letter version available.

In Builder#setLanguage

1. Do nothing.  Builder only checks if the given language code is 
well-formed (2*8ALPHA)
2. Map.  Builder maps 3-letter language code if there is 2-letter 
version available.
3. Invalidate.  Builder check if the input 3-letter language code has a 
2-letter version and throws an exception if exists.

In Locale#toLanguageTag

1. Do nothing.  toLanguageTag() only checks if the given language code 
is well-formed (2*8ALPHA)
2. Map.  toLanguageTag() maps 3-letter language code if there is 
2-letter version available.

I think 3-to-2 mapping in ISO639 is practically frozen.  If this is 
true, we do not have any concerns for the mapping.  I prefer to prevent 
such canonically equivalent Locales are created (that is, do the mapping 
when a Locale is created by constructors and builders).  Builder is a 
new API, so we can do whatever we want.  But without making this change 
in the constructors, it does not make sense.  If we can ignore the 
behavior change, I prefer to do the mapping in the locale constructors - 
more specifically - new Locale("eng").getLanguage() changes from "en" to 
"eng".  (How much do we need to care about backward compatibility?  The 
use of 3-letter code in Locale constructor was illegal.  Even there are 
applications setting 3-letter language code in Locale, I think they have 
no reasons to use 3-letter codes if there are 2-letter correspondings...)

If this behavior change in Locale constructors is not acceptable, I 
prefer to do nothing everywhere.  In this case, JDK just tream "en_US" 
and "eng_US" as different Locales and toLanguageTag produces illegal 
BCP47 tags.

Any suggestions?

-Yoshito




More information about the locale-enhancement-dev mailing list