[loc-en-dev] A ISO639 3-letter code which has a 2-letter code
Yoshito Umaoka
y.umaoka at gmail.com
Sun Mar 8 19:55:47 PDT 2009
The current JavaDoc for Locale constructors defines the language
parameter is lowercase two-letter ISO-639 code. The actual
implementation convert the language code string to lowercase, but it
does not check the length nor it's a valid ISO639.1 language code. So
you can use any arbitrary string as a language.
For supporting BCP47 language tag, we should update the description and
explain that valid BCP47 langtag can be used, also in the existing
constructor. (This change itself does not introduce any implementation
changes.. just documentation.)
The Locale constructor maps input language code "he" to "iw" for
stability reason. No matter you use "he" or "iw" as language code,
Locale#getLanguage() returns "iw" for Hebrew language. This
implementation is tricky, but it also prevent two canonically equivalent
Locales are created via APIs.
Now, we need to figure out what to do with ISO639 3-letter language
codes which have ISO639.1 2-letter codes. For example, English has
ISO639.1 code "en" as well as ISO639.2/639.3 code "eng". BCP47 itself
prohibit a 3 letter code is used for a language if it has 2-letter version.
I think there are several possible options for this problem.
In Locale constructors -
1. Do nothing. Locale constructors do not check if an input 3-letter
language code has a 2-letter version.
2. Map. Locale constructors map 3-letter language code if there is
2-letter version available.
In Builder#setLanguage
1. Do nothing. Builder only checks if the given language code is
well-formed (2*8ALPHA)
2. Map. Builder maps 3-letter language code if there is 2-letter
version available.
3. Invalidate. Builder check if the input 3-letter language code has a
2-letter version and throws an exception if exists.
In Locale#toLanguageTag
1. Do nothing. toLanguageTag() only checks if the given language code
is well-formed (2*8ALPHA)
2. Map. toLanguageTag() maps 3-letter language code if there is
2-letter version available.
I think 3-to-2 mapping in ISO639 is practically frozen. If this is
true, we do not have any concerns for the mapping. I prefer to prevent
such canonically equivalent Locales are created (that is, do the mapping
when a Locale is created by constructors and builders). Builder is a
new API, so we can do whatever we want. But without making this change
in the constructors, it does not make sense. If we can ignore the
behavior change, I prefer to do the mapping in the locale constructors -
more specifically - new Locale("eng").getLanguage() changes from "en" to
"eng". (How much do we need to care about backward compatibility? The
use of 3-letter code in Locale constructor was illegal. Even there are
applications setting 3-letter language code in Locale, I think they have
no reasons to use 3-letter codes if there are 2-letter correspondings...)
If this behavior change in Locale constructors is not acceptable, I
prefer to do nothing everywhere. In this case, JDK just tream "en_US"
and "eng_US" as different Locales and toLanguageTag produces illegal
BCP47 tags.
Any suggestions?
-Yoshito
More information about the locale-enhancement-dev
mailing list