[loc-en-dev] Comments on the locale enhancement proposal

Mon Feb 2 12:02:07 PST 2009

Masayoshi Okutsu wrote:
 > On 1/21/2009 9:13 AM, Doug Felt wrote:
 >>
 >>
 >> On Tue, Jan 20, 2009 at 4:04 PM, Masayoshi Okutsu 
<Masayoshi.Okutsu at sun.com <mailto:Masayoshi.Okutsu at sun.com>> wrote:
 >>
 >>     I think it's obvious that we can't support old data with new
 >>     identifiers perfectly, like zh_Hans_CN and zh_Hant_CN. When we
 >>     can't support both, I prefer to define a simple algorithm to
 >>     produce a look-up sequences with minimum exceptions. [...]
 >>
 >>
 >> Can define one so we can understand what cases you intend to handle 
and how?
 >
 > My preference is:
 >
 > (1) Treat language+script as a writingsystem which produces sequence 
language_script -> language.
 >

If we forget about legacy RB organization, it makes sense.  However.. 
see my comments for the next item.

 > (2) Apply the traditional sequence production rule to 
writingsystem_country_variant
 >
 > writingsystem_country_variant
 > writingsystem_country
 > writingsystem
 >
 > each of which produces language_script -> language. Therefore, the 
entire sequence is:
 >
 > language_script_country_variant
 > language_country_variant
 > language_script_country
 > language_country
 > language_script
 > language
 >
 > For example, the sequence for zh_Hans_CN is:
 >
 > zh_Hans_CN
 > zh_CN
 > zh_Hans
 > zh
 >
 > while the proposed one is:
 >
 > zh_Hans_CN
 > zh_Hans
 > zh_CN
 > zh
 >

zh_Hans_CN -> zh_CN -> zh_Hans -> zh may work OK for this specific case. 
  However, when a country has two commonly used script, this order may 
not work as we expect.  For example, let's see sr_Latn_RS.  With you 
suggestion, the order of look up will be -

sr_Latn_RS
sr_RS
sr_Latn
sr

In general, writing system is more important than country variant.  For 
Seribian used in Serbia, Cyrillic script is likely used as a default 
script.  Therefore, existing resource sr_RS likely has Cyrillic 
contents.  Some may want to add Latn variant along with sr_RS and tag it 
sr_Latn_RS and add its parent sr_Latn, sr_Latin may be hidden by sr_RS 
by this lookup order.

I think we're talking about which one matches better for sr_Latn_RS - 
sr_RS or sr_Latin.  And, in this case, I think sr_Latin is the answer.

 > (3) If no script is given, the sequence is the same as the 
traditional one.
 >
 > language_country_variant
 > language_country
 > language
 >

This does not work well unless we supply a default script for languages 
which has two or more script variants.  For a request - zh_HK, this 
suggestion produces following candidates -

zh_HK
zh

But, for people who want to distinguish scripts with the new framework 
may have resources zh_Hant_HK, but not zh_HK.  When a language has 
commonly used multiple variants and one of them is dominant in a 
country, the expanson - wrinting system (without script) -> writing 
system with script is desired.

 > (4) Exceptions are Norwegian and Hebrew.
 >
 > no_NO -> nb_NO -> no -> nb
 > no_NO_NY -> nn_NO -> no_NO -> nn -> no
 > nn_NO -> no_NO_NY -> nn -> no
 > nb_NO -> no_NO -> nb -> no
 >
 > he_IL -> iw_IL -> he -> iw
 > iw_IL -> he_IL -> iw -> he

I think we should distinguish Norwegian case from Hebrew case.  For 
Hebrea, he is exactly equal to iw.  For Norwegian, strictly speaking, no 
could be nb or nn.  I'm fine with the order of Hebrew above.  But I 
think Norwegian case should be handled differently.

-Yoshito