From y.umaoka at gmail.com Mon Feb 2 12:00:11 2009 From: y.umaoka at gmail.com (Yoshito Umaoka) Date: Mon, 02 Feb 2009 15:00:11 -0500 Subject: [loc-en-dev] locale extensions Message-ID: <498750CB.10704@gmail.com> In the Locale Enhancement proposal, we're proposing several new locale elements - script and keywords. In BCP47 language tag specification, available elements are - language, script, country, variant and extensions. In the LDML specification, keywords are mapped to BCP47 extensions. For example - Unicode Locale Identifier: es_ES at collation=traditional is mapped to BCP47 language tag es-ES-k-collatio-traditio (Note: each segment in BCP47 language tag must be up to 8*ALPHA, so collation is truncated to "collatio", traditional is truncated to "traditio") This assumes "k" is reserved for LDML keywords. In future, other application may register a letter for their use. One of the goals for this project is to support compatibility with BCP47. In BCP47, the keyword is one of extension type. From the design point of view, I propose to store all of BCP47 extensions in a single object - LocaleExtensions. LocaleExtensions may contain keywords as well as other extensions. And LocaleExtensions is a field in Locale class. An instance of LocaleExtensions is created only via LocaleBuilder. Here is my assumption about extensions - For keywords - Multiple keywords are allowed - No two keywords have the same name (calendar=islamic;calendar=gregorian is invalid) - toString converts keywords to @name1=value1;name2=value2, for example, @calendar=buddhist;@number=thai. For other extensions - Multiple extensions are allowed - No two extensions have the exact same values - toString converts these extensions to @letter=value. For example, BCP47 en-US-a-yoshito will be converted to en_US at a=yoshito The major difference between keywords and other extensions is - - A keyword in BCP47 is always represented by -k--. In Locale#toString(), it is converted to = - Other extensions in represented by --[-...]. In Locale#toString(), it is converted to =[-...]. Do you see any problems with above assumptions/behavior? -Yoshito From y.umaoka at gmail.com Mon Feb 2 12:02:07 2009 From: y.umaoka at gmail.com (Yoshito Umaoka) Date: Mon, 02 Feb 2009 15:02:07 -0500 Subject: [loc-en-dev] Comments on the locale enhancement proposal Message-ID: <4987513F.1020303@gmail.com> Masayoshi Okutsu wrote: > On 1/21/2009 9:13 AM, Doug Felt wrote: >> >> >> On Tue, Jan 20, 2009 at 4:04 PM, Masayoshi Okutsu > wrote: >> >> I think it's obvious that we can't support old data with new >> identifiers perfectly, like zh_Hans_CN and zh_Hant_CN. When we >> can't support both, I prefer to define a simple algorithm to >> produce a look-up sequences with minimum exceptions. [...] >> >> >> Can define one so we can understand what cases you intend to handle and how? > > My preference is: > > (1) Treat language+script as a writingsystem which produces sequence language_script -> language. > If we forget about legacy RB organization, it makes sense. However.. see my comments for the next item. > (2) Apply the traditional sequence production rule to writingsystem_country_variant > > writingsystem_country_variant > writingsystem_country > writingsystem > > each of which produces language_script -> language. Therefore, the entire sequence is: > > language_script_country_variant > language_country_variant > language_script_country > language_country > language_script > language > > For example, the sequence for zh_Hans_CN is: > > zh_Hans_CN > zh_CN > zh_Hans > zh > > while the proposed one is: > > zh_Hans_CN > zh_Hans > zh_CN > zh > zh_Hans_CN -> zh_CN -> zh_Hans -> zh may work OK for this specific case. However, when a country has two commonly used script, this order may not work as we expect. For example, let's see sr_Latn_RS. With you suggestion, the order of look up will be - sr_Latn_RS sr_RS sr_Latn sr In general, writing system is more important than country variant. For Seribian used in Serbia, Cyrillic script is likely used as a default script. Therefore, existing resource sr_RS likely has Cyrillic contents. Some may want to add Latn variant along with sr_RS and tag it sr_Latn_RS and add its parent sr_Latn, sr_Latin may be hidden by sr_RS by this lookup order. I think we're talking about which one matches better for sr_Latn_RS - sr_RS or sr_Latin. And, in this case, I think sr_Latin is the answer. > (3) If no script is given, the sequence is the same as the traditional one. > > language_country_variant > language_country > language > This does not work well unless we supply a default script for languages which has two or more script variants. For a request - zh_HK, this suggestion produces following candidates - zh_HK zh But, for people who want to distinguish scripts with the new framework may have resources zh_Hant_HK, but not zh_HK. When a language has commonly used multiple variants and one of them is dominant in a country, the expanson - wrinting system (without script) -> writing system with script is desired. > (4) Exceptions are Norwegian and Hebrew. > > no_NO -> nb_NO -> no -> nb > no_NO_NY -> nn_NO -> no_NO -> nn -> no > nn_NO -> no_NO_NY -> nn -> no > nb_NO -> no_NO -> nb -> no > > he_IL -> iw_IL -> he -> iw > iw_IL -> he_IL -> iw -> he I think we should distinguish Norwegian case from Hebrew case. For Hebrea, he is exactly equal to iw. For Norwegian, strictly speaking, no could be nb or nn. I'm fine with the order of Hebrew above. But I think Norwegian case should be handled differently. -Yoshito From Naoto.Sato at Sun.COM Mon Feb 2 13:10:35 2009 From: Naoto.Sato at Sun.COM (Naoto Sato) Date: Mon, 02 Feb 2009 13:10:35 -0800 Subject: [loc-en-dev] Comments to the draft spec Message-ID: <4987614B.9020700@Sun.COM> Hello, I am very sorry for the delay, but here are some of my comments/thoughts regarding the draft spec. Since we are shooting for the JDK7, I would like to do something minimum, as Masayoshi said. - I would like to keep the current constructors of the Locale class intact. The new Locale instances that represents BCP47 or LDML should only be built by the Locale Builder. I believe this is clear enough to developers. - IDType "JAVA" may not be needed. Current Java locales can already be instantiated through the existing Locale constructors. Moreover I don't think we need to support such as "ja_JP_JP"/"no_NY_NY" in the new builder, which will make the implementation more complicated. - Question: should we support both BCP47 and LDML ids in the API spec? Can we just have LDML ids as an implementation? E.g., if the id has -k-calendar-japanese, we use Japanese calendar. If we want LDML in the API, we could add it after JDK7. - Make sure that the new locale builder requires to have the "language" element, which is not true for the existing constructors. Thanks, -- Naoto Sato From y.umaoka at gmail.com Mon Feb 2 13:29:42 2009 From: y.umaoka at gmail.com (Yoshito Umaoka) Date: Mon, 02 Feb 2009 16:29:42 -0500 Subject: [loc-en-dev] Comments to the draft spec In-Reply-To: <4987614B.9020700@Sun.COM> References: <4987614B.9020700@Sun.COM> Message-ID: <498765C6.5090804@gmail.com> Naoto Sato wrote: > Hello, > > I am very sorry for the delay, but here are some of my comments/thoughts > regarding the draft spec. Since we are shooting for the JDK7, I would > like to do something minimum, as Masayoshi said. > > - I would like to keep the current constructors of the Locale class > intact. The new Locale instances that represents BCP47 or LDML should > only be built by the Locale Builder. I believe this is clear enough to > developers. Agree. No changes in the existing Local constructors and its behavior. > > - IDType "JAVA" may not be needed. Current Java locales can already be > instantiated through the existing Locale constructors. Moreover I don't > think we need to support such as "ja_JP_JP"/"no_NY_NY" in the new > builder, which will make the implementation more complicated. > Yes, IDType "JAVA" might be redundant. Actually, what we probably need is just BCP47. I'm not sure not supporting "ja_JP_JP"/"no_NO_NY" in the builder makes sense. The builder should be "super" interface for creating any possible Locale instances. > - Question: should we support both BCP47 and LDML ids in the API spec? > Can we just have LDML ids as an implementation? E.g., if the id has > -k-calendar-japanese, we use Japanese calendar. If we want LDML in the > API, we could add it after JDK7. > I think most of people are interested in the compatibility between Java Locales and BCP47 language tags. Immediate requirement is to get a Locale from a language tag without loss of data. LDML keyword is just an implementation of BCP47 extensions. At least, I think it's important to make Locale to store "extensions" without information loss. Making i18n service classes to use the information (calendar=japanese) is yet another topic and we could defer a part of this effort after JDK7. > - Make sure that the new locale builder requires to have the "language" > element, which is not true for the existing constructors. > I'm not sure it makes sense. We could do this, but any problems with "language" less Locale? In the CLDR project, we realized we need such concept for handling locale specific behavior purely depending on region. We moved out some data from the legacy language_region inheritance tree to another structure. For example, currency, calendar types are in this category. We could use "und" (undefined language) for such Locales. Could you describe any "problematic" cases with empty language? -Yoshito From Naoto.Sato at Sun.COM Mon Feb 2 14:06:10 2009 From: Naoto.Sato at Sun.COM (Naoto Sato) Date: Mon, 02 Feb 2009 14:06:10 -0800 Subject: [loc-en-dev] Comments to the draft spec In-Reply-To: <498765C6.5090804@gmail.com> References: <4987614B.9020700@Sun.COM> <498765C6.5090804@gmail.com> Message-ID: <49876E52.6020008@Sun.COM> Yoshito-san, comments inline. Yoshito Umaoka wrote: > Naoto Sato wrote: >> Hello, >> >> I am very sorry for the delay, but here are some of my >> comments/thoughts regarding the draft spec. Since we are shooting for >> the JDK7, I would like to do something minimum, as Masayoshi said. >> >> - I would like to keep the current constructors of the Locale class >> intact. The new Locale instances that represents BCP47 or LDML should >> only be built by the Locale Builder. I believe this is clear enough >> to developers. > > Agree. No changes in the existing Local constructors and its behavior. > >> >> - IDType "JAVA" may not be needed. Current Java locales can already >> be instantiated through the existing Locale constructors. Moreover I >> don't think we need to support such as "ja_JP_JP"/"no_NY_NY" in the >> new builder, which will make the implementation more complicated. >> > > Yes, IDType "JAVA" might be redundant. Actually, what we probably need > is just BCP47. > > I'm not sure not supporting "ja_JP_JP"/"no_NO_NY" in the builder makes > sense. The builder should be "super" interface for creating any > possible Locale instances. If it's BCP47, "ja_JP_JP" or "no_NO_NY" is illegal because the variant subtag cannot be two-letter code. I think limiting builder strictly to BCP47 is OK because developers can always use the old constructors. > >> - Question: should we support both BCP47 and LDML ids in the API spec? >> Can we just have LDML ids as an implementation? E.g., if the id has >> -k-calendar-japanese, we use Japanese calendar. If we want LDML in >> the API, we could add it after JDK7. >> > > I think most of people are interested in the compatibility between Java > Locales and BCP47 language tags. Immediate requirement is to get a > Locale from a language tag without loss of data. > > LDML keyword is just an implementation of BCP47 extensions. At least, I > think it's important to make Locale to store "extensions" without > information loss. Making i18n service classes to use the information > (calendar=japanese) is yet another topic and we could defer a part of > this effort after JDK7. Does LDML define the mappings between, say "-k-collatio-traditio" and "@collation=traditional"? If it's clearly defined in the LDML, we can just say, Locale class implements -k extension as LDML's keywords. > >> - Make sure that the new locale builder requires to have the >> "language" element, which is not true for the existing constructors. >> > > I'm not sure it makes sense. We could do this, but any problems with > "language" less Locale? In the CLDR project, we realized we need such > concept for handling locale specific behavior purely depending on > region. We moved out some data from the legacy language_region > inheritance tree to another structure. For example, currency, calendar > types are in this category. > > We could use "und" (undefined language) for such Locales. Could you > describe any "problematic" cases with empty language? The reason is, again BCP47 conformance. If I understand it correctly, language subtag is mandatory in BCP47. Thanks, -- Naoto Sato From y.umaoka at gmail.com Mon Feb 2 14:35:58 2009 From: y.umaoka at gmail.com (Yoshito Umaoka) Date: Mon, 02 Feb 2009 17:35:58 -0500 Subject: [loc-en-dev] Comments to the draft spec In-Reply-To: <49876E52.6020008@Sun.COM> References: <4987614B.9020700@Sun.COM> <498765C6.5090804@gmail.com> <49876E52.6020008@Sun.COM> Message-ID: <4987754E.4070601@gmail.com> Sato-san, I added my comments below - > If it's BCP47, "ja_JP_JP" or "no_NO_NY" is illegal because the variant > subtag cannot be two-letter code. I think limiting builder strictly to > BCP47 is OK because developers can always use the old constructors. I got your point. Yes, ja-JP-JP is illegal unless variant JP is registered in IANA registry. It could be mapped to ja-JP-x-JP if my understanding is correct. I do not like to invalidate existing Java Locales just for this reason. There are two possible solutions here - 1. Register these Java's proprietary enhancement to the IANA registry. 2. Do not apply strict validation when a Locale is created, but handle it when converting to BCP47 language tag. > > Does LDML define the mappings between, say "-k-collatio-traditio" and > "@collation=traditional"? If it's clearly defined in the LDML, we can > just say, Locale class implements -k extension as LDML's keywords. > It's defined in the latest LDML spec. CLDR team actually should register letter "k" for the purpose in the language tag registry, which is not yet done. We had already agreed to register "k" to the language tag registry. I actually raised an issue about the truncation and we'll discuss this in the CLDR meeting tomorrow. The problem is that an LDML keyword can be systematically mapped to BCP47 extension, but you need full keyword name /value list for the inverse mapping. For example, "-k-collatio-traditio" is mapped to "collation=traditonal" in LDML, but you need to know "collatio" is the truncated form of LDML "collation", "traditio" is the truncated from of LDML keyword value "traditional". > The reason is, again BCP47 conformance. If I understand it correctly, > language subtag is mandatory in BCP47. I think so too. But, as I mentioned, language "und" is valid in BCP47. So empty language code in a Locale can be interpreted as "und" in BCP47 language tag. -Yoshito From y.umaoka at gmail.com Mon Feb 2 14:42:57 2009 From: y.umaoka at gmail.com (Yoshito Umaoka) Date: Mon, 02 Feb 2009 17:42:57 -0500 Subject: [loc-en-dev] Comments on the locale enhancement proposal In-Reply-To: <4976667D.2010904@sun.com> References: <49758A95.2080909@sun.com> <146f39a80901201226u740f3a0bn7dde09b6f0817bcd@mail.gmail.com> <4976667D.2010904@sun.com> Message-ID: <498776F1.3070307@gmail.com> Okutsu-san, I'd like to confirm your statement below. Masayoshi Okutsu wrote: > Comments inline. > > On 1/21/2009 5:26 AM, Doug Felt wrote: >> Comments inline. >> >> On Tue, Jan 20, 2009 at 12:25 AM, Masayoshi Okutsu >> > wrote: >> >> >> >> My proposal is: >> >> * the existing interfaces should be kept fully compatible in >> both binaries and source code. >> >> Can you define more precisely what you mean? Do you mean no API >> additions to the Locale class? > > No semantic or behavior changes to the existing methods. I saw the > toString() behavior change in the 6.1 Script table and the semantic > changes to the Locale constructors. We should just describe the current > behavior where it's necessary. > > It's OK to add new methods to Locale. > Let's assume an instance of Locale is created from language tag "zh-Hans-CN". The proposal suggest Locale#toString() to return "zh_Hans_CN". Do you think this behavior is problematic? Are you suggesting to add a new method, for exmaple, Locale#getID() to return "zh_Hans_CN", but not to put the script "Hans" and extra separator "_" in the result of #toString()? Thanks, Yoshito From Naoto.Sato at Sun.COM Mon Feb 2 16:48:41 2009 From: Naoto.Sato at Sun.COM (Naoto Sato) Date: Mon, 02 Feb 2009 16:48:41 -0800 Subject: [loc-en-dev] Comments to the draft spec In-Reply-To: <4987754E.4070601@gmail.com> References: <4987614B.9020700@Sun.COM> <498765C6.5090804@gmail.com> <49876E52.6020008@Sun.COM> <4987754E.4070601@gmail.com> Message-ID: <49879469.202@Sun.COM> Yoshito Umaoka wrote: > Sato-san, I added my comments below - > >> If it's BCP47, "ja_JP_JP" or "no_NO_NY" is illegal because the variant >> subtag cannot be two-letter code. I think limiting builder strictly >> to BCP47 is OK because developers can always use the old constructors. > > I got your point. Yes, ja-JP-JP is illegal unless variant JP is > registered in IANA registry. It could be mapped to ja-JP-x-JP if my > understanding is correct. > > I do not like to invalidate existing Java Locales just for this reason. > There are two possible solutions here - > > 1. Register these Java's proprietary enhancement to the IANA registry. > 2. Do not apply strict validation when a Locale is created, but handle > it when converting to BCP47 language tag. I prefer 1. if possible. But I don't know whether it's feasible as to time wise. > >> >> Does LDML define the mappings between, say "-k-collatio-traditio" and >> "@collation=traditional"? If it's clearly defined in the LDML, we can >> just say, Locale class implements -k extension as LDML's keywords. >> > > It's defined in the latest LDML spec. CLDR team actually should > register letter "k" for the purpose in the language tag registry, which > is not yet done. We had already agreed to register "k" to the language > tag registry. > > I actually raised an issue about the truncation and we'll discuss this > in the CLDR meeting tomorrow. The problem is that an LDML keyword can > be systematically mapped to BCP47 extension, but you need full keyword > name /value list for the inverse mapping. For example, > "-k-collatio-traditio" is mapped to "collation=traditonal" in LDML, but > you need to know "collatio" is the truncated form of LDML "collation", > "traditio" is the truncated from of LDML keyword value "traditional". Right. That's exactly what I meant. If "-k" is registered as a keyword for LDML, LDML spec can just define "-k-collatio-[traditio | whatever]", so that the round trip is ensured. BTW, what happens if two LDML keywords that have the same first 8 characters mapped to BCP47 -k extension? > >> The reason is, again BCP47 conformance. If I understand it correctly, >> language subtag is mandatory in BCP47. > > I think so too. But, as I mentioned, language "und" is valid in BCP47. > So empty language code in a Locale can be interpreted as "und" in BCP47 > language tag. I am fine with it. Thanks, -- Naoto Sato From Masayoshi.Okutsu at Sun.COM Mon Feb 2 19:11:54 2009 From: Masayoshi.Okutsu at Sun.COM (Masayoshi Okutsu) Date: Tue, 03 Feb 2009 12:11:54 +0900 Subject: [loc-en-dev] Comments on the locale enhancement proposal In-Reply-To: <498776F1.3070307@gmail.com> References: <49758A95.2080909@sun.com> <146f39a80901201226u740f3a0bn7dde09b6f0817bcd@mail.gmail.com> <4976667D.2010904@sun.com> <498776F1.3070307@gmail.com> Message-ID: <4987B5FA.1000404@sun.com> On 2/3/2009 7:42 AM, Yoshito Umaoka wrote: > Okutsu-san, I'd like to confirm your statement below. > > Masayoshi Okutsu wrote: >> Comments inline. >> >> On 1/21/2009 5:26 AM, Doug Felt wrote: >>> Comments inline. >>> >>> On Tue, Jan 20, 2009 at 12:25 AM, Masayoshi Okutsu >>> > wrote: >>> >>> >>> >>> My proposal is: >>> >>> * the existing interfaces should be kept fully compatible in >>> both binaries and source code. >>> >>> Can you define more precisely what you mean? Do you mean no API >>> additions to the Locale class? >> >> No semantic or behavior changes to the existing methods. I saw the >> toString() behavior change in the 6.1 Script table and the semantic >> changes to the Locale constructors. We should just describe the >> current behavior where it's necessary. >> >> It's OK to add new methods to Locale. >> > > Let's assume an instance of Locale is created from language tag > "zh-Hans-CN". The proposal suggest Locale#toString() to return > "zh_Hans_CN". Do you think this behavior is problematic? Are you > suggesting to add a new method, for exmaple, Locale#getID() to return > "zh_Hans_CN", but not to put the script "Hans" and extra separator "_" > in the result of #toString()? I think returning "zh_Hans_CN" may cause a problem. Let's think about the following scenario. (1) Application A and B communicate through RMI (i.e., serialization). (2) A is script-aware, while B may be or may not. (3) B uses 3rd party class library L which isn't script-aware. Suppose both A and B are running in JDK 7, and that A sends a Locale from "zh-Hans-CN" to B. B passes the given Locale to L. In this case, L might be confused with "zh_Hans_CN" from toString(). We could say, "Don't do that." But if someone complains it's an incompatible change in JDK 7, we will need to give up the new behavior of toString(). If the complaint comes after the JDK 7 release, it will be a tragedy... Thanks, Masayoshi From y.umaoka at gmail.com Mon Feb 2 19:33:53 2009 From: y.umaoka at gmail.com (Yoshito Umaoka) Date: Mon, 02 Feb 2009 22:33:53 -0500 Subject: [loc-en-dev] Comments on the locale enhancement proposal In-Reply-To: <4987B5FA.1000404@sun.com> References: <49758A95.2080909@sun.com> <146f39a80901201226u740f3a0bn7dde09b6f0817bcd@mail.gmail.com> <4976667D.2010904@sun.com> <498776F1.3070307@gmail.com> <4987B5FA.1000404@sun.com> Message-ID: <4987BB21.6050001@gmail.com> Masayoshi Okutsu wrote: >> Let's assume an instance of Locale is created from language tag >> "zh-Hans-CN". The proposal suggest Locale#toString() to return >> "zh_Hans_CN". Do you think this behavior is problematic? Are you >> suggesting to add a new method, for exmaple, Locale#getID() to return >> "zh_Hans_CN", but not to put the script "Hans" and extra separator "_" >> in the result of #toString()? > > I think returning "zh_Hans_CN" may cause a problem. Let's think about > the following scenario. > > (1) Application A and B communicate through RMI (i.e., serialization). > (2) A is script-aware, while B may be or may not. > (3) B uses 3rd party class library L which isn't script-aware. > > Suppose both A and B are running in JDK 7, and that A sends a Locale > from "zh-Hans-CN" to B. B passes the given Locale to L. In this case, L > might be confused with "zh_Hans_CN" from toString(). > > We could say, "Don't do that." But if someone complains it's an > incompatible change in JDK 7, we will need to give up the new behavior > of toString(). If the complaint comes after the JDK 7 release, it will > be a tragedy... I do not understand what you wrote above. Locale has 3 member fields - language, country and variant. When an instance of Locale is being serialized, these fields are preserved in the serialized form. Even we internally add extra fields or change the internal representation of these fields, we have to write out these 3 separated fields for supporting serialization compatibility. In the scenario above, I would expect Locale("zh", "CN") at the other end (pre-JDK7). Of course, it loses the script information, which is not ideal, but at least the problem which you mentioned above should not happen. It is true that there might be an existing application depending on its String representation and making an assumption - A locale string consist from up to 3 fields delimitted by "_" - 1st one is language, 2nd one is country and the rest is variant. If we need to avoid this - we could - 1. toString by the Java convension, we still want to write out entire fields information, including script, extensions... If we append these information to the end of variant, I would expect the impact is minimum. 2. With the change above, we want another method to return formal "programmatic name". Probably we need to add getID() to do so. If we decided to go this way, we should update the document to encourage people to use getID() instead of toString() to get a canonical string representation of a Locale. Although we could do such things for supporting full backward compatibility, I prefer not to do so. Am I missing anything? Thanks, Yoshito From Masayoshi.Okutsu at Sun.COM Mon Feb 2 20:01:00 2009 From: Masayoshi.Okutsu at Sun.COM (Masayoshi Okutsu) Date: Tue, 03 Feb 2009 13:01:00 +0900 Subject: [loc-en-dev] Comments on the locale enhancement proposal In-Reply-To: <4987BB21.6050001@gmail.com> References: <49758A95.2080909@sun.com> <146f39a80901201226u740f3a0bn7dde09b6f0817bcd@mail.gmail.com> <4976667D.2010904@sun.com> <498776F1.3070307@gmail.com> <4987B5FA.1000404@sun.com> <4987BB21.6050001@gmail.com> Message-ID: <4987C17C.8090509@sun.com> On 2/3/2009 12:33 PM, Yoshito Umaoka wrote: > Masayoshi Okutsu wrote: > >>> Let's assume an instance of Locale is created from language tag >>> "zh-Hans-CN". The proposal suggest Locale#toString() to return >>> "zh_Hans_CN". Do you think this behavior is problematic? Are you >>> suggesting to add a new method, for exmaple, Locale#getID() to >>> return "zh_Hans_CN", but not to put the script "Hans" and extra >>> separator "_" in the result of #toString()? >> >> I think returning "zh_Hans_CN" may cause a problem. Let's think about >> the following scenario. >> >> (1) Application A and B communicate through RMI (i.e., serialization). >> (2) A is script-aware, while B may be or may not. >> (3) B uses 3rd party class library L which isn't script-aware. >> >> Suppose both A and B are running in JDK 7, and that A sends a Locale >> from "zh-Hans-CN" to B. B passes the given Locale to L. In this case, >> L might be confused with "zh_Hans_CN" from toString(). >> >> We could say, "Don't do that." But if someone complains it's an >> incompatible change in JDK 7, we will need to give up the new >> behavior of toString(). If the complaint comes after the JDK 7 >> release, it will be a tragedy... > > I do not understand what you wrote above. Locale has 3 member fields > - language, country and variant. When an instance of Locale is being > serialized, these fields are preserved in the serialized form. Even > we internally add extra fields or change the internal representation > of these fields, we have to write out these 3 separated fields for > supporting serialization compatibility. In the scenario above, I > would expect Locale("zh", "CN") at the other end (pre-JDK7). If B was running in pre-JDK7, it's true that the deserialized Locale is zh_CN. But in my scenario both A and B are running in JDK 7. (B might want to use some new APIs of JDK 7 while B needs to continue to use library L.) Thanks, Masayoshi > Of course, it loses the script information, which is not ideal, but > at least the problem which you mentioned above should not happen. > > It is true that there might be an existing application depending on > its String representation and making an assumption - A locale string > consist from up to 3 fields delimitted by "_" - 1st one is language, > 2nd one is country and the rest is variant. If we need to avoid this > - we could - > > 1. toString by the Java convension, we still want to write out entire > fields information, including script, extensions... If we append > these information to the end of variant, I would expect the impact is > minimum. > > 2. With the change above, we want another method to return formal > "programmatic name". Probably we need to add getID() to do so. If we > decided to go this way, we should update the document to encourage > people to use getID() instead of toString() to get a canonical string > representation of a Locale. > > Although we could do such things for supporting full backward > compatibility, I prefer not to do so. > > Am I missing anything? > > Thanks, > Yoshito > > > From Masayoshi.Okutsu at Sun.COM Tue Feb 3 01:01:48 2009 From: Masayoshi.Okutsu at Sun.COM (Masayoshi Okutsu) Date: Tue, 03 Feb 2009 18:01:48 +0900 Subject: [loc-en-dev] Comments on the locale enhancement proposal In-Reply-To: <4987513F.1020303@gmail.com> References: <4987513F.1020303@gmail.com> Message-ID: <498807FC.9060803@sun.com> I believe neither is perfect. My point is that the system should provide simple mechanisms. If it's too inconvenient to put everything in sr_Latn_RS, we could treat sr_Latn as an exception, like sr_Latn_RS -> sr_Latn -> (root). JDK already has some exceptions, like zh_HK->zh_TW->zh, anyway. Thanks, Masayoshi On 2/3/2009 5:02 AM, Yoshito Umaoka wrote: > Masayoshi Okutsu wrote: > > On 1/21/2009 9:13 AM, Doug Felt wrote: > >> > >> > >> On Tue, Jan 20, 2009 at 4:04 PM, Masayoshi Okutsu > > wrote: > >> > >> I think it's obvious that we can't support old data with new > >> identifiers perfectly, like zh_Hans_CN and zh_Hant_CN. When we > >> can't support both, I prefer to define a simple algorithm to > >> produce a look-up sequences with minimum exceptions. [...] > >> > >> > >> Can define one so we can understand what cases you intend to handle > and how? > > > > My preference is: > > > > (1) Treat language+script as a writingsystem which produces sequence > language_script -> language. > > > > If we forget about legacy RB organization, it makes sense. However.. > see my comments for the next item. > > > (2) Apply the traditional sequence production rule to > writingsystem_country_variant > > > > writingsystem_country_variant > > writingsystem_country > > writingsystem > > > > each of which produces language_script -> language. Therefore, the > entire sequence is: > > > > language_script_country_variant > > language_country_variant > > language_script_country > > language_country > > language_script > > language > > > > For example, the sequence for zh_Hans_CN is: > > > > zh_Hans_CN > > zh_CN > > zh_Hans > > zh > > > > while the proposed one is: > > > > zh_Hans_CN > > zh_Hans > > zh_CN > > zh > > > > zh_Hans_CN -> zh_CN -> zh_Hans -> zh may work OK for this specific > case. However, when a country has two commonly used script, this > order may not work as we expect. For example, let's see sr_Latn_RS. > With you suggestion, the order of look up will be - > > sr_Latn_RS > sr_RS > sr_Latn > sr > > In general, writing system is more important than country variant. > For Seribian used in Serbia, Cyrillic script is likely used as a > default script. Therefore, existing resource sr_RS likely has > Cyrillic contents. Some may want to add Latn variant along with sr_RS > and tag it sr_Latn_RS and add its parent sr_Latn, sr_Latin may be > hidden by sr_RS by this lookup order. > > I think we're talking about which one matches better for sr_Latn_RS - > sr_RS or sr_Latin. And, in this case, I think sr_Latin is the answer. > > > > (3) If no script is given, the sequence is the same as the > traditional one. > > > > language_country_variant > > language_country > > language > > > > This does not work well unless we supply a default script for > languages which has two or more script variants. For a request - > zh_HK, this suggestion produces following candidates - > > zh_HK > zh > > But, for people who want to distinguish scripts with the new framework > may have resources zh_Hant_HK, but not zh_HK. When a language has > commonly used multiple variants and one of them is dominant in a > country, the expanson - wrinting system (without script) -> writing > system with script is desired. > > > > (4) Exceptions are Norwegian and Hebrew. > > > > no_NO -> nb_NO -> no -> nb > > no_NO_NY -> nn_NO -> no_NO -> nn -> no > > nn_NO -> no_NO_NY -> nn -> no > > nb_NO -> no_NO -> nb -> no > > > > he_IL -> iw_IL -> he -> iw > > iw_IL -> he_IL -> iw -> he > > I think we should distinguish Norwegian case from Hebrew case. For > Hebrea, he is exactly equal to iw. For Norwegian, strictly speaking, > no could be nb or nn. I'm fine with the order of Hebrew above. But I > think Norwegian case should be handled differently. > > -Yoshito From Naoto.Sato at Sun.COM Tue Feb 3 11:53:34 2009 From: Naoto.Sato at Sun.COM (Naoto Sato) Date: Tue, 03 Feb 2009 11:53:34 -0800 Subject: [loc-en-dev] Comments on the locale enhancement proposal In-Reply-To: <4987BB21.6050001@gmail.com> References: <49758A95.2080909@sun.com> <146f39a80901201226u740f3a0bn7dde09b6f0817bcd@mail.gmail.com> <4976667D.2010904@sun.com> <498776F1.3070307@gmail.com> <4987B5FA.1000404@sun.com> <4987BB21.6050001@gmail.com> Message-ID: <4988A0BE.5050306@Sun.COM> Yoshito Umaoka wrote: > Masayoshi Okutsu wrote: > >>> Let's assume an instance of Locale is created from language tag >>> "zh-Hans-CN". The proposal suggest Locale#toString() to return >>> "zh_Hans_CN". Do you think this behavior is problematic? Are you >>> suggesting to add a new method, for exmaple, Locale#getID() to return >>> "zh_Hans_CN", but not to put the script "Hans" and extra separator >>> "_" in the result of #toString()? >> >> I think returning "zh_Hans_CN" may cause a problem. Let's think about >> the following scenario. >> >> (1) Application A and B communicate through RMI (i.e., serialization). >> (2) A is script-aware, while B may be or may not. >> (3) B uses 3rd party class library L which isn't script-aware. >> >> Suppose both A and B are running in JDK 7, and that A sends a Locale >> from "zh-Hans-CN" to B. B passes the given Locale to L. In this case, >> L might be confused with "zh_Hans_CN" from toString(). >> >> We could say, "Don't do that." But if someone complains it's an >> incompatible change in JDK 7, we will need to give up the new behavior >> of toString(). If the complaint comes after the JDK 7 release, it will >> be a tragedy... > > I do not understand what you wrote above. Locale has 3 member fields - > language, country and variant. When an instance of Locale is being > serialized, these fields are preserved in the serialized form. Even we > internally add extra fields or change the internal representation of > these fields, we have to write out these 3 separated fields for > supporting serialization compatibility. In the scenario above, I would > expect Locale("zh", "CN") at the other end (pre-JDK7). Of course, it > loses the script information, which is not ideal, but at least the > problem which you mentioned above should not happen. > > It is true that there might be an existing application depending on its > String representation and making an assumption - A locale string consist > from up to 3 fields delimitted by "_" - 1st one is language, 2nd one is > country and the rest is variant. If we need to avoid this - we could - > > 1. toString by the Java convension, we still want to write out entire > fields information, including script, extensions... If we append these > information to the end of variant, I would expect the impact is minimum. > > 2. With the change above, we want another method to return formal > "programmatic name". Probably we need to add getID() to do so. If we > decided to go this way, we should update the document to encourage > people to use getID() instead of toString() to get a canonical string > representation of a Locale. I was thinking that toString(IDType) in the draft spec was supposed to do this, wasn't it? Probably we should add more descriptive name to this method like toCanonicalName() (I removed "IDType" argument as we may end up supporting BCP47 only). Naoto > > Although we could do such things for supporting full backward > compatibility, I prefer not to do so. > > Am I missing anything? > > Thanks, > Yoshito > > > -- Naoto Sato From y.umaoka at gmail.com Tue Feb 3 12:00:03 2009 From: y.umaoka at gmail.com (Yoshito Umaoka) Date: Tue, 03 Feb 2009 15:00:03 -0500 Subject: [loc-en-dev] Comments on the locale enhancement proposal In-Reply-To: <4988A0BE.5050306@Sun.COM> References: <49758A95.2080909@sun.com> <146f39a80901201226u740f3a0bn7dde09b6f0817bcd@mail.gmail.com> <4976667D.2010904@sun.com> <498776F1.3070307@gmail.com> <4987B5FA.1000404@sun.com> <4987BB21.6050001@gmail.com> <4988A0BE.5050306@Sun.COM> Message-ID: <4988A243.2090104@gmail.com> Sato-san, > I was thinking that toString(IDType) in the draft spec was supposed to do this, wasn't it? > > Probably we should add more descriptive name to this method like toCanonicalName() (I removed "IDType" argument as we may end up supporting BCP47 only). I put the ID string problem as one of main topics in today's project call. Thanks, Yoshito From y.umaoka at gmail.com Mon Feb 9 13:30:38 2009 From: y.umaoka at gmail.com (Yoshito Umaoka) Date: Mon, 09 Feb 2009 16:30:38 -0500 Subject: [loc-en-dev] grandfathered language tags Message-ID: <4990A07E.3020909@gmail.com> I scanned the latest language tag registry - http://www.iana.org/assignments/language-subtag-registry There is a category - grandfathered. Use of these tags are valid in BCP47 language tag. Some of them were deprecated and its preferred "well-formed" mappings. Below is the full list of grandfathered tags currently available (File-Date: 2009-01-13). art-lojban(deprecated) -> jbo cel-gaulish en-GB-oed i-ami i-bnn i-default i-enochian i-hak(deprecated) -> zh-hakka i-klingon(deprecated) -> tlh i-lux(deprecated) -> lb i-mingo i-navajo(deprecated) -> nv i-pwn i-tao i-tay i-tsu no-bok(deprecated) -> nb no-nyn(deprecated) -> nn sgn-BE-fr sgn-BE-nl sgn-CH-de zh-cmn zh-cmn-Hans zh-cmn-Hant zh-gan zh-guoyu(deprecated) -> zh-cmn zh-hakka zh-min zh-min-nan zh-wuu zh-xiang zh-yue I'm wondering if we should include the support for grandfathered tags, especially ones which do not have well-formed mappings. If we want to support such cases, I would expect the whole string is handled as a single unit (do not try to parse them out into separated fields.). Anyway, I'd like to get your inputs. Thanks, Yoshito From y.umaoka at gmail.com Tue Feb 10 06:53:54 2009 From: y.umaoka at gmail.com (Yoshito Umaoka) Date: Tue, 10 Feb 2009 09:53:54 -0500 Subject: [loc-en-dev] [Fwd: grandfathered language tags] Message-ID: <49919502.7010205@gmail.com> I scanned the data in the latest draft - http://www.ietf.org/internet-drafts/draft-ietf-ltru-4645bis-09.txt With this update (expected coming soon), the grandfathered list look like below - art-lojban(deprecated) -> jbo cel-gaulish en-GB-oed i-ami(deprecated) -> ami i-bnn(deprecated) -> bnn i-default i-enochian i-hak(deprecated) -> hak i-klingon(deprecated) -> tlh i-lux(deprecated) -> lb i-mingo i-navajo(deprecated) -> nv i-pwn(deprecated) -> pwn i-tao(deprecated) -> tao i-tay(deprecated) -> tay i-tsu(deprecated) -> tsu no-bok(deprecated) -> nb no-nyn(deprecated) -> nn sgn-BE-FR(deprecated) -> sfb sgn-BE-NL(deprecated) -> vgt sgn-CH-DE(deprecated) -> sgg zh-guoyu(deprecated) -> cmn zh-hakka(deprecated) -> hak zh-min(deprecated) zh-min-nan(deprecated) -> nan zh-xiang(deprecated) -> hsn -Yoshito -------- Original Message -------- Subject: grandfathered language tags Date: Mon, 09 Feb 2009 16:30:38 -0500 From: Yoshito Umaoka To: locale-enhancement-dev at openjdk.java.net I scanned the latest language tag registry - http://www.iana.org/assignments/language-subtag-registry There is a category - grandfathered. Use of these tags are valid in BCP47 language tag. Some of them were deprecated and its preferred "well-formed" mappings. Below is the full list of grandfathered tags currently available (File-Date: 2009-01-13). art-lojban(deprecated) -> jbo cel-gaulish en-GB-oed i-ami i-bnn i-default i-enochian i-hak(deprecated) -> zh-hakka i-klingon(deprecated) -> tlh i-lux(deprecated) -> lb i-mingo i-navajo(deprecated) -> nv i-pwn i-tao i-tay i-tsu no-bok(deprecated) -> nb no-nyn(deprecated) -> nn sgn-BE-fr sgn-BE-nl sgn-CH-de zh-cmn zh-cmn-Hans zh-cmn-Hant zh-gan zh-guoyu(deprecated) -> zh-cmn zh-hakka zh-min zh-min-nan zh-wuu zh-xiang zh-yue I'm wondering if we should include the support for grandfathered tags, especially ones which do not have well-formed mappings. If we want to support such cases, I would expect the whole string is handled as a single unit (do not try to parse them out into separated fields.). Anyway, I'd like to get your inputs. Thanks, Yoshito From Naoto.Sato at Sun.COM Thu Feb 12 13:23:56 2009 From: Naoto.Sato at Sun.COM (Naoto Sato) Date: Thu, 12 Feb 2009 13:23:56 -0800 Subject: [loc-en-dev] [Fwd: grandfathered language tags] In-Reply-To: <49919502.7010205@gmail.com> References: <49919502.7010205@gmail.com> Message-ID: <4994936C.2050005@Sun.COM> I vote against supporting this for JDK7 due to the schedule/resource. I believe that this could be added later. For JDK7 I think we should only focus on "langtag" in the following BCP47 ABNF. Language-Tag = langtag / privateuse ; private use tag / grandfathered ; grandfathered registrations Thanks, Naoto Yoshito Umaoka wrote: > I scanned the data in the latest draft - > http://www.ietf.org/internet-drafts/draft-ietf-ltru-4645bis-09.txt > > With this update (expected coming soon), the grandfathered list look > like below - > > art-lojban(deprecated) -> jbo > cel-gaulish > en-GB-oed > i-ami(deprecated) -> ami > i-bnn(deprecated) -> bnn > i-default > i-enochian > i-hak(deprecated) -> hak > i-klingon(deprecated) -> tlh > i-lux(deprecated) -> lb > i-mingo > i-navajo(deprecated) -> nv > i-pwn(deprecated) -> pwn > i-tao(deprecated) -> tao > i-tay(deprecated) -> tay > i-tsu(deprecated) -> tsu > no-bok(deprecated) -> nb > no-nyn(deprecated) -> nn > sgn-BE-FR(deprecated) -> sfb > sgn-BE-NL(deprecated) -> vgt > sgn-CH-DE(deprecated) -> sgg > zh-guoyu(deprecated) -> cmn > zh-hakka(deprecated) -> hak > zh-min(deprecated) > zh-min-nan(deprecated) -> nan > zh-xiang(deprecated) -> hsn > > > -Yoshito > > > -------- Original Message -------- > Subject: grandfathered language tags > Date: Mon, 09 Feb 2009 16:30:38 -0500 > From: Yoshito Umaoka > To: locale-enhancement-dev at openjdk.java.net > > I scanned the latest language tag registry - > http://www.iana.org/assignments/language-subtag-registry > > There is a category - grandfathered. Use of these tags are valid in > BCP47 language tag. Some of them were deprecated and its preferred > "well-formed" mappings. Below is the full list of grandfathered tags > currently available (File-Date: 2009-01-13). > > art-lojban(deprecated) -> jbo > cel-gaulish > en-GB-oed > i-ami > i-bnn > i-default > i-enochian > i-hak(deprecated) -> zh-hakka > i-klingon(deprecated) -> tlh > i-lux(deprecated) -> lb > i-mingo > i-navajo(deprecated) -> nv > i-pwn > i-tao > i-tay > i-tsu > no-bok(deprecated) -> nb > no-nyn(deprecated) -> nn > sgn-BE-fr > sgn-BE-nl > sgn-CH-de > zh-cmn > zh-cmn-Hans > zh-cmn-Hant > zh-gan > zh-guoyu(deprecated) -> zh-cmn > zh-hakka > zh-min > zh-min-nan > zh-wuu > zh-xiang > zh-yue > > I'm wondering if we should include the support for grandfathered tags, > especially ones which do not have well-formed mappings. If we want to > support such cases, I would expect the whole string is handled as a > single unit (do not try to parse them out into separated fields.). > Anyway, I'd like to get your inputs. > > Thanks, > Yoshito > -- Naoto Sato From dougfelt at google.com Thu Feb 12 13:59:31 2009 From: dougfelt at google.com (Doug Felt) Date: Thu, 12 Feb 2009 13:59:31 -0800 Subject: [loc-en-dev] [Fwd: grandfathered language tags] In-Reply-To: <4994936C.2050005@Sun.COM> References: <49919502.7010205@gmail.com> <4994936C.2050005@Sun.COM> Message-ID: <146f39a80902121359m579451e5l44c54a3dfd1c208e@mail.gmail.com> So I guess you're suggesting, Naoto, that we throw an exception for these now? It does seem that we could accept and parse private use tags without doing canonicalization. That leaves only the grandfathered tags that we'd reject outright, and of course if there is a standards-body-defined list we can handle those internally as special cases. I'm not sure how much additional work this would be. It's my feeling that all the work is in the specification and API design (and in writing thorough tests), and that the actual implementation is relatively straightforward by comparison. Doug On Thu, Feb 12, 2009 at 1:23 PM, Naoto Sato wrote: > I vote against supporting this for JDK7 due to the schedule/resource. I > believe that this could be added later. For JDK7 I think we should only > focus on "langtag" in the following BCP47 ABNF. > > Language-Tag = langtag > / privateuse ; private use tag > / grandfathered ; grandfathered registrations > > Thanks, > Naoto > > > Yoshito Umaoka wrote: > >> I scanned the data in the latest draft - >> http://www.ietf.org/internet-drafts/draft-ietf-ltru-4645bis-09.txt >> >> With this update (expected coming soon), the grandfathered list look like >> below - >> >> art-lojban(deprecated) -> jbo >> cel-gaulish >> en-GB-oed >> i-ami(deprecated) -> ami >> i-bnn(deprecated) -> bnn >> i-default >> i-enochian >> i-hak(deprecated) -> hak >> i-klingon(deprecated) -> tlh >> i-lux(deprecated) -> lb >> i-mingo >> i-navajo(deprecated) -> nv >> i-pwn(deprecated) -> pwn >> i-tao(deprecated) -> tao >> i-tay(deprecated) -> tay >> i-tsu(deprecated) -> tsu >> no-bok(deprecated) -> nb >> no-nyn(deprecated) -> nn >> sgn-BE-FR(deprecated) -> sfb >> sgn-BE-NL(deprecated) -> vgt >> sgn-CH-DE(deprecated) -> sgg >> zh-guoyu(deprecated) -> cmn >> zh-hakka(deprecated) -> hak >> zh-min(deprecated) >> zh-min-nan(deprecated) -> nan >> zh-xiang(deprecated) -> hsn >> >> >> -Yoshito >> >> >> -------- Original Message -------- >> Subject: grandfathered language tags >> Date: Mon, 09 Feb 2009 16:30:38 -0500 >> From: Yoshito Umaoka >> To: locale-enhancement-dev at openjdk.java.net >> >> I scanned the latest language tag registry - >> http://www.iana.org/assignments/language-subtag-registry >> >> There is a category - grandfathered. Use of these tags are valid in >> BCP47 language tag. Some of them were deprecated and its preferred >> "well-formed" mappings. Below is the full list of grandfathered tags >> currently available (File-Date: 2009-01-13). >> >> art-lojban(deprecated) -> jbo >> cel-gaulish >> en-GB-oed >> i-ami >> i-bnn >> i-default >> i-enochian >> i-hak(deprecated) -> zh-hakka >> i-klingon(deprecated) -> tlh >> i-lux(deprecated) -> lb >> i-mingo >> i-navajo(deprecated) -> nv >> i-pwn >> i-tao >> i-tay >> i-tsu >> no-bok(deprecated) -> nb >> no-nyn(deprecated) -> nn >> sgn-BE-fr >> sgn-BE-nl >> sgn-CH-de >> zh-cmn >> zh-cmn-Hans >> zh-cmn-Hant >> zh-gan >> zh-guoyu(deprecated) -> zh-cmn >> zh-hakka >> zh-min >> zh-min-nan >> zh-wuu >> zh-xiang >> zh-yue >> >> I'm wondering if we should include the support for grandfathered tags, >> especially ones which do not have well-formed mappings. If we want to >> support such cases, I would expect the whole string is handled as a >> single unit (do not try to parse them out into separated fields.). >> Anyway, I'd like to get your inputs. >> >> Thanks, >> Yoshito >> >> > > -- > Naoto Sato > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/locale-enhancement-dev/attachments/20090212/383d2cb4/attachment.html From Naoto.Sato at Sun.COM Thu Feb 12 15:37:20 2009 From: Naoto.Sato at Sun.COM (Naoto Sato) Date: Thu, 12 Feb 2009 15:37:20 -0800 Subject: [loc-en-dev] [Fwd: grandfathered language tags] In-Reply-To: <146f39a80902121359m579451e5l44c54a3dfd1c208e@mail.gmail.com> References: <49919502.7010205@gmail.com> <4994936C.2050005@Sun.COM> <146f39a80902121359m579451e5l44c54a3dfd1c208e@mail.gmail.com> Message-ID: <4994B2B0.5000906@Sun.COM> Or default to a default locale. Anyway, the point I would like to make here is not technical, but more like a project management reason. To make JDK7, we should go for the minimal function set and we have to nail down the spec/resource/schedule ASAP. Since this project is not tied to any release at the moment, we will need to persuade the JDK7 planning team with a concrete plan, in order for them to include this feature. Naoto Doug Felt wrote: > So I guess you're suggesting, Naoto, that we throw an exception for > these now? > > It does seem that we could accept and parse private use tags without > doing canonicalization. That leaves only the grandfathered tags that > we'd reject outright, and of course if there is a standards-body-defined > list we can handle those internally as special cases. I'm not sure how > much additional work this would be. It's my feeling that all the work > is in the specification and API design (and in writing thorough tests), > and that the actual implementation is relatively straightforward by > comparison. > > Doug > > On Thu, Feb 12, 2009 at 1:23 PM, Naoto Sato > wrote: > > I vote against supporting this for JDK7 due to the > schedule/resource. I believe that this could be added later. For > JDK7 I think we should only focus on "langtag" in the following > BCP47 ABNF. > > Language-Tag = langtag > / privateuse ; private use tag > / grandfathered ; grandfathered registrations > > Thanks, > Naoto > > > Yoshito Umaoka wrote: > > I scanned the data in the latest draft - > http://www.ietf.org/internet-drafts/draft-ietf-ltru-4645bis-09.txt > > With this update (expected coming soon), the grandfathered list > look like below - > > art-lojban(deprecated) -> jbo > cel-gaulish > en-GB-oed > i-ami(deprecated) -> ami > i-bnn(deprecated) -> bnn > i-default > i-enochian > i-hak(deprecated) -> hak > i-klingon(deprecated) -> tlh > i-lux(deprecated) -> lb > i-mingo > i-navajo(deprecated) -> nv > i-pwn(deprecated) -> pwn > i-tao(deprecated) -> tao > i-tay(deprecated) -> tay > i-tsu(deprecated) -> tsu > no-bok(deprecated) -> nb > no-nyn(deprecated) -> nn > sgn-BE-FR(deprecated) -> sfb > sgn-BE-NL(deprecated) -> vgt > sgn-CH-DE(deprecated) -> sgg > zh-guoyu(deprecated) -> cmn > zh-hakka(deprecated) -> hak > zh-min(deprecated) > zh-min-nan(deprecated) -> nan > zh-xiang(deprecated) -> hsn > > > -Yoshito > > > -------- Original Message -------- > Subject: grandfathered language tags > Date: Mon, 09 Feb 2009 16:30:38 -0500 > From: Yoshito Umaoka > > To: locale-enhancement-dev at openjdk.java.net > > > I scanned the latest language tag registry - > http://www.iana.org/assignments/language-subtag-registry > > There is a category - grandfathered. Use of these tags are valid in > BCP47 language tag. Some of them were deprecated and its preferred > "well-formed" mappings. Below is the full list of grandfathered > tags > currently available (File-Date: 2009-01-13). > > art-lojban(deprecated) -> jbo > cel-gaulish > en-GB-oed > i-ami > i-bnn > i-default > i-enochian > i-hak(deprecated) -> zh-hakka > i-klingon(deprecated) -> tlh > i-lux(deprecated) -> lb > i-mingo > i-navajo(deprecated) -> nv > i-pwn > i-tao > i-tay > i-tsu > no-bok(deprecated) -> nb > no-nyn(deprecated) -> nn > sgn-BE-fr > sgn-BE-nl > sgn-CH-de > zh-cmn > zh-cmn-Hans > zh-cmn-Hant > zh-gan > zh-guoyu(deprecated) -> zh-cmn > zh-hakka > zh-min > zh-min-nan > zh-wuu > zh-xiang > zh-yue > > I'm wondering if we should include the support for grandfathered > tags, > especially ones which do not have well-formed mappings. If we > want to > support such cases, I would expect the whole string is handled as a > single unit (do not try to parse them out into separated fields.). > Anyway, I'd like to get your inputs. > > Thanks, > Yoshito > > > > -- > Naoto Sato > > -- Naoto Sato From Masayoshi.Okutsu at Sun.COM Thu Feb 12 16:56:48 2009 From: Masayoshi.Okutsu at Sun.COM (Masayoshi Okutsu) Date: Fri, 13 Feb 2009 09:56:48 +0900 Subject: [loc-en-dev] [Fwd: grandfathered language tags] In-Reply-To: <4994B2B0.5000906@Sun.COM> References: <49919502.7010205@gmail.com> <4994936C.2050005@Sun.COM> <146f39a80902121359m579451e5l44c54a3dfd1c208e@mail.gmail.com> <4994B2B0.5000906@Sun.COM> Message-ID: <4994C550.9010509@sun.com> That is my concern too as I stated in my comments message. Once we integrate the minimal function set to JDK 7, the rest should be easier to handle. But it's hard to integrate big changes in the late development cycle. Masayoshi On 2/13/2009 8:37 AM, Naoto Sato wrote: > Or default to a default locale. > > Anyway, the point I would like to make here is not technical, but more > like a project management reason. To make JDK7, we should go for the > minimal function set and we have to nail down the > spec/resource/schedule ASAP. Since this project is not tied to any > release at the moment, we will need to persuade the JDK7 planning team > with a concrete plan, in order for them to include this feature. > > Naoto > > Doug Felt wrote: >> So I guess you're suggesting, Naoto, that we throw an exception for >> these now? >> >> It does seem that we could accept and parse private use tags without >> doing canonicalization. That leaves only the grandfathered tags that >> we'd reject outright, and of course if there is a >> standards-body-defined list we can handle those internally as special >> cases. I'm not sure how much additional work this would be. It's my >> feeling that all the work is in the specification and API design (and >> in writing thorough tests), and that the actual implementation is >> relatively straightforward by comparison. >> >> Doug >> >> On Thu, Feb 12, 2009 at 1:23 PM, Naoto Sato > > wrote: >> >> I vote against supporting this for JDK7 due to the >> schedule/resource. I believe that this could be added later. For >> JDK7 I think we should only focus on "langtag" in the following >> BCP47 ABNF. >> >> Language-Tag = langtag >> / privateuse ; private use tag >> / grandfathered ; grandfathered registrations >> >> Thanks, >> Naoto >> >> >> Yoshito Umaoka wrote: >> >> I scanned the data in the latest draft - >> >> http://www.ietf.org/internet-drafts/draft-ietf-ltru-4645bis-09.txt >> >> With this update (expected coming soon), the grandfathered list >> look like below - >> >> art-lojban(deprecated) -> jbo >> cel-gaulish >> en-GB-oed >> i-ami(deprecated) -> ami >> i-bnn(deprecated) -> bnn >> i-default >> i-enochian >> i-hak(deprecated) -> hak >> i-klingon(deprecated) -> tlh >> i-lux(deprecated) -> lb >> i-mingo >> i-navajo(deprecated) -> nv >> i-pwn(deprecated) -> pwn >> i-tao(deprecated) -> tao >> i-tay(deprecated) -> tay >> i-tsu(deprecated) -> tsu >> no-bok(deprecated) -> nb >> no-nyn(deprecated) -> nn >> sgn-BE-FR(deprecated) -> sfb >> sgn-BE-NL(deprecated) -> vgt >> sgn-CH-DE(deprecated) -> sgg >> zh-guoyu(deprecated) -> cmn >> zh-hakka(deprecated) -> hak >> zh-min(deprecated) >> zh-min-nan(deprecated) -> nan >> zh-xiang(deprecated) -> hsn >> >> >> -Yoshito >> >> >> -------- Original Message -------- >> Subject: grandfathered language tags >> Date: Mon, 09 Feb 2009 16:30:38 -0500 >> From: Yoshito Umaoka > > >> To: locale-enhancement-dev at openjdk.java.net >> >> >> I scanned the latest language tag registry - >> http://www.iana.org/assignments/language-subtag-registry >> >> There is a category - grandfathered. Use of these tags are >> valid in >> BCP47 language tag. Some of them were deprecated and its >> preferred >> "well-formed" mappings. Below is the full list of grandfathered >> tags >> currently available (File-Date: 2009-01-13). >> >> art-lojban(deprecated) -> jbo >> cel-gaulish >> en-GB-oed >> i-ami >> i-bnn >> i-default >> i-enochian >> i-hak(deprecated) -> zh-hakka >> i-klingon(deprecated) -> tlh >> i-lux(deprecated) -> lb >> i-mingo >> i-navajo(deprecated) -> nv >> i-pwn >> i-tao >> i-tay >> i-tsu >> no-bok(deprecated) -> nb >> no-nyn(deprecated) -> nn >> sgn-BE-fr >> sgn-BE-nl >> sgn-CH-de >> zh-cmn >> zh-cmn-Hans >> zh-cmn-Hant >> zh-gan >> zh-guoyu(deprecated) -> zh-cmn >> zh-hakka >> zh-min >> zh-min-nan >> zh-wuu >> zh-xiang >> zh-yue >> >> I'm wondering if we should include the support for grandfathered >> tags, >> especially ones which do not have well-formed mappings. If we >> want to >> support such cases, I would expect the whole string is >> handled as a >> single unit (do not try to parse them out into separated >> fields.). >> Anyway, I'd like to get your inputs. >> >> Thanks, >> Yoshito >> >> >> >> -- Naoto Sato >> >> > > From y.umaoka at gmail.com Thu Feb 12 22:58:58 2009 From: y.umaoka at gmail.com (Yoshito Umaoka) Date: Fri, 13 Feb 2009 01:58:58 -0500 Subject: [loc-en-dev] [Fwd: Re: [Fwd: grandfathered language tags]] Message-ID: <49951A32.9080009@gmail.com> oops, I responded to Okutsu-san, not to the ML.. -------- Original Message -------- Subject: Re: [loc-en-dev] [Fwd: grandfathered language tags] Date: Thu, 12 Feb 2009 20:16:09 -0500 From: Yoshito Umaoka To: Masayoshi Okutsu References: <49919502.7010205 at gmail.com> <4994936C.2050005 at Sun.COM> <146f39a80902121359m579451e5l44c54a3dfd1c208e at mail.gmail.com> <4994B2B0.5000906 at Sun.COM> <4994C550.9010509 at sun.com> I see. I think the minimal function which we should support is to guarantee canonical round trip between Locale and language tag reasonably and allow Java user to access each piece. We should focus this piece now. Keywording is more likely "semantic" interpretation of the extension part. The most demanding function for Java users would be creating a Locale object from a language tag without losing information. For grandfathered tags, I do not think we need to introduce any new structure. The list is relatively small and I think the list won't grow in future. Most of them are already deprecated and preferred form (which fits the standard scheme) is available (for example, art-lojban -> jbo). -Yoshito On Thu, Feb 12, 2009 at 7:56 PM, Masayoshi Okutsu wrote: > > That is my concern too as I stated in my comments message. Once we integrate the minimal function set to JDK 7, the rest should be easier to handle. But it's hard to integrate big changes in the late development cycle. > > Masayoshi > > On 2/13/2009 8:37 AM, Naoto Sato wrote: >> >> Or default to a default locale. >> >> Anyway, the point I would like to make here is not technical, but more like a project management reason. To make JDK7, we should go for the minimal function set and we have to nail down the spec/resource/schedule ASAP. Since this project is not tied to any release at the moment, we will need to persuade the JDK7 planning team with a concrete plan, in order for them to include this feature. >> >> Naoto >> >> Doug Felt wrote: >>> >>> So I guess you're suggesting, Naoto, that we throw an exception for these now? >>> >>> It does seem that we could accept and parse private use tags without doing canonicalization. That leaves only the grandfathered tags that we'd reject outright, and of course if there is a standards-body-defined list we can handle those internally as special cases. I'm not sure how much additional work this would be. It's my feeling that all the work is in the specification and API design (and in writing thorough tests), and that the actual implementation is relatively straightforward by comparison. >>> >>> Doug >>> >>> On Thu, Feb 12, 2009 at 1:23 PM, Naoto Sato > wrote: >>> >>> I vote against supporting this for JDK7 due to the >>> schedule/resource. I believe that this could be added later. For >>> JDK7 I think we should only focus on "langtag" in the following >>> BCP47 ABNF. >>> >>> Language-Tag = langtag >>> / privateuse ; private use tag >>> / grandfathered ; grandfathered registrations >>> >>> Thanks, >>> Naoto >>> >>> >>> Yoshito Umaoka wrote: >>> >>> I scanned the data in the latest draft - >>> http://www.ietf.org/internet-drafts/draft-ietf-ltru-4645bis-09.txt >>> >>> With this update (expected coming soon), the grandfathered list >>> look like below - >>> >>> art-lojban(deprecated) -> jbo >>> cel-gaulish >>> en-GB-oed >>> i-ami(deprecated) -> ami >>> i-bnn(deprecated) -> bnn >>> i-default >>> i-enochian >>> i-hak(deprecated) -> hak >>> i-klingon(deprecated) -> tlh >>> i-lux(deprecated) -> lb >>> i-mingo >>> i-navajo(deprecated) -> nv >>> i-pwn(deprecated) -> pwn >>> i-tao(deprecated) -> tao >>> i-tay(deprecated) -> tay >>> i-tsu(deprecated) -> tsu >>> no-bok(deprecated) -> nb >>> no-nyn(deprecated) -> nn >>> sgn-BE-FR(deprecated) -> sfb >>> sgn-BE-NL(deprecated) -> vgt >>> sgn-CH-DE(deprecated) -> sgg >>> zh-guoyu(deprecated) -> cmn >>> zh-hakka(deprecated) -> hak >>> zh-min(deprecated) >>> zh-min-nan(deprecated) -> nan >>> zh-xiang(deprecated) -> hsn >>> >>> >>> -Yoshito >>> >>> >>> -------- Original Message -------- >>> Subject: grandfathered language tags >>> Date: Mon, 09 Feb 2009 16:30:38 -0500 >>> From: Yoshito Umaoka >> > >>> To: locale-enhancement-dev at openjdk.java.net >>> >>> >>> I scanned the latest language tag registry - >>> http://www.iana.org/assignments/language-subtag-registry >>> >>> There is a category - grandfathered. Use of these tags are valid in >>> BCP47 language tag. Some of them were deprecated and its preferred >>> "well-formed" mappings. Below is the full list of grandfathered >>> tags >>> currently available (File-Date: 2009-01-13). >>> >>> art-lojban(deprecated) -> jbo >>> cel-gaulish >>> en-GB-oed >>> i-ami >>> i-bnn >>> i-default >>> i-enochian >>> i-hak(deprecated) -> zh-hakka >>> i-klingon(deprecated) -> tlh >>> i-lux(deprecated) -> lb >>> i-mingo >>> i-navajo(deprecated) -> nv >>> i-pwn >>> i-tao >>> i-tay >>> i-tsu >>> no-bok(deprecated) -> nb >>> no-nyn(deprecated) -> nn >>> sgn-BE-fr >>> sgn-BE-nl >>> sgn-CH-de >>> zh-cmn >>> zh-cmn-Hans >>> zh-cmn-Hant >>> zh-gan >>> zh-guoyu(deprecated) -> zh-cmn >>> zh-hakka >>> zh-min >>> zh-min-nan >>> zh-wuu >>> zh-xiang >>> zh-yue >>> >>> I'm wondering if we should include the support for grandfathered >>> tags, >>> especially ones which do not have well-formed mappings. If we >>> want to >>> support such cases, I would expect the whole string is handled as a >>> single unit (do not try to parse them out into separated fields.). >>> Anyway, I'd like to get your inputs. >>> >>> Thanks, >>> Yoshito >>> >>> >>> >>> -- Naoto Sato >>> >>> >> >> From y.umaoka at gmail.com Wed Feb 18 08:25:46 2009 From: y.umaoka at gmail.com (Yoshito Umaoka) Date: Wed, 18 Feb 2009 11:25:46 -0500 Subject: [loc-en-dev] JavaDoc for proposed APIs Message-ID: <499C368A.8050105@gmail.com> Hi folks, I generated JavaDoc for proposed APIs (updated from the original design proposal) and posted here -> http://sites.google.com/site/openjdklocale/apis Unfortunately, google site does not allow me to post plain HTML, so links in these JavaDoc are not working well. All APIs not available in JDK6 are marked as [New API]. APIs which were already agreed by members are marked as [New API - agreed]. I'm implementing these APIs and Doug and myself will merge the changes to the OpenJDK Locale Enhancement repository by March 3 for evaluation. Please feel free to post your feedback to this ML. The next bi-weekly call is scheduled on March 3. We need to make APIs finalized until March 17 for submitting CCC for JDK7. -Yoshito From y.umaoka at gmail.com Wed Feb 18 09:02:46 2009 From: y.umaoka at gmail.com (Yoshito Umaoka) Date: Wed, 18 Feb 2009 12:02:46 -0500 Subject: [loc-en-dev] About Locale#toString() Message-ID: <499C3F36.3090906@gmail.com> In the bi-weekly project call, we agreed not to change the behavior of toString(). This implies that you won't get any new field information (such as script and extensions) returned by toString(). In the current proposed API set, we have toLanguageTag(), which returns syntactically valid BCP47 language tag string. However, subtags in a BCP47 language tag is delimited by hyphen('-') instead of underscore('_'). One of the goals in this project is to include script field value involved in the resource bundle lookup inheritance. Therefore, I would like to have a method creating a locale string delimited by underscore, which can be used for resource bundle suffix. (Technically, this can be achieved by composing the string by appending getLanguage(), getScript()...) This is a common operation and I think it is worth having such API. I'm considering following three APIs for the purpose. String toFullString() Locale getBaseLocale() Locale getParent() toFullString() is a variant of toString() to generate a string representation of Locale, but also include script and extensions if they are available. getBaseLocale() returns a Locale (proposed implementation is to return a singleton) without locale extensions. Locale extensions is not used for resource bundle lookup. getParent() returns a parent Locale (proposed implementation is to return a singleton). A parent locale represent a locale omitting the most right field of its child locale. For example, Locale("en") is a parent locale of Locale("en", "US"). If a locale has a variant field and the variant field contains one or more underscore characters, then its parent still have variant field, but excluding the substring after the last underscore. For example, Locale("en", "US", "NYC") is a parent locale of Locale("en", "US", "NYC_JFK") With these 3 APIs, the resource bundle is collecting key-value pairs with the pseudo code below - Locale target; // the resolved Locale Locale loc = target; ResourceBundleImpl child = null; while (true) { ResourceBundleImpl aBundle = loadFrom(bundleBaseName + "_" + loc.getBaseLocale().toFullString()); if (child != null) { child.parent = aBundle; } loc = loc.getParent(); if (loc == null) { // Locale.ROOT.getParent() returns null break; } child = aBundle; } Do you think we should have such APIs? Also, if you do, do you want to make them public or keep them package local/private? -Yoshito From dougfelt at google.com Wed Feb 18 11:31:59 2009 From: dougfelt at google.com (Doug Felt) Date: Wed, 18 Feb 2009 11:31:59 -0800 Subject: [loc-en-dev] About Locale#toString() In-Reply-To: <499C3F36.3090906@gmail.com> References: <499C3F36.3090906@gmail.com> Message-ID: <146f39a80902181131u2784864cw8333331e39a95445@mail.gmail.com> What is the motivation for treating the variant field underscore-by-underscore rather than as an entire unit? Doug On Wed, Feb 18, 2009 at 9:02 AM, Yoshito Umaoka wrote: > In the bi-weekly project call, we agreed not to change the behavior of > toString(). This implies that you won't get any new field information (such > as script and extensions) returned by toString(). > > In the current proposed API set, we have toLanguageTag(), which returns > syntactically valid BCP47 language tag string. However, subtags in a BCP47 > language tag is delimited by hyphen('-') instead of underscore('_'). One of > the goals in this project is to include script field value involved in the > resource bundle lookup inheritance. Therefore, I would like to have a method > creating a locale string delimited by underscore, which can be used for > resource bundle suffix. (Technically, this can be achieved by composing the > string by appending getLanguage(), getScript()...) This is a common > operation and I think it is worth having such API. > > I'm considering following three APIs for the purpose. > > String toFullString() > Locale getBaseLocale() > Locale getParent() > > > toFullString() is a variant of toString() to generate a string > representation of Locale, but also include script and extensions if they are > available. > > getBaseLocale() returns a Locale (proposed implementation is to return a > singleton) without locale extensions. Locale extensions is not used for > resource bundle lookup. > > getParent() returns a parent Locale (proposed implementation is to return a > singleton). A parent locale represent a locale omitting the most right > field of its child locale. For example, Locale("en") is a parent locale of > Locale("en", "US"). If a locale has a variant field and the variant field > contains one or more underscore characters, then its parent still have > variant field, but excluding the substring after the last underscore. For > example, Locale("en", "US", "NYC") is a parent locale of Locale("en", "US", > "NYC_JFK") > > With these 3 APIs, the resource bundle is collecting key-value pairs with > the pseudo code below - > > Locale target; // the resolved Locale > Locale loc = target; > ResourceBundleImpl child = null; > while (true) { > ResourceBundleImpl aBundle = loadFrom(bundleBaseName + "_" + > loc.getBaseLocale().toFullString()); > if (child != null) { > child.parent = aBundle; > } > loc = loc.getParent(); > if (loc == null) { > // Locale.ROOT.getParent() returns null > break; > } > child = aBundle; > } > > Do you think we should have such APIs? Also, if you do, do you want to > make them public or keep them package local/private? > > -Yoshito > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/locale-enhancement-dev/attachments/20090218/2c0f2af2/attachment.html From y.umaoka at gmail.com Wed Feb 18 12:50:01 2009 From: y.umaoka at gmail.com (Yoshito Umaoka) Date: Wed, 18 Feb 2009 15:50:01 -0500 Subject: [loc-en-dev] About Locale#toString() In-Reply-To: <146f39a80902181131u2784864cw8333331e39a95445@mail.gmail.com> References: <499C3F36.3090906@gmail.com> <146f39a80902181131u2784864cw8333331e39a95445@mail.gmail.com> Message-ID: <499C7479.7050505@gmail.com> Thank you for pointing out the problem. I mixed up the ICU implementation with JDK. You're right and we do not need to support inheritance within variant component. Actually, this opens up another question. BCP47 itself supports multiple variant values by its syntax definition. The IANA language tag directory contains following variants - 1606nict 1694acad 1901 1959acad 1994 1996 arevela arevmda baku1926 biske boont fonipa fonupa kkcor lipaw monoton nedis njiva osojs pinyin polyton rozaj scotland scouse solba tarask uccor ucrcor valencia wadegile These values are actually constrained by prefix. For example, %% Type: variant Subtag: scotland Description: Scottish Standard English Added: 2007-08-31 Prefix: en %% Type: variant Subtag: scouse Description: Scouse Added: 2006-09-18 Prefix: en Comments: English Liverpudlian dialect known as 'Scouse' So "en-scotland" is a valid language tag, "en-scouse" is also a valid language tag, but I'm not sure about "en-scotland-scouse" or "en-scouse-scotlan". Practically, such combination does not make sense. But I could not find any description that explains these are invalid. If these are valid, language range "en-scotland" could match "en-scotland-scouse" by the RFC4647 part of BCP47. Anyway, I cannot imagine any practical language tag which has multiple IANA registered variants, I think it's probably OK to process variant as a single field and not supprting inheritance within a variant. I'll check LTRU folks if multiple variant values are currently allowed. -Yoshito Doug Felt wrote: > What is the motivation for treating the variant field > underscore-by-underscore rather than as an entire unit? > > Doug > > On Wed, Feb 18, 2009 at 9:02 AM, Yoshito Umaoka > wrote: > > In the bi-weekly project call, we agreed not to change the behavior > of toString(). This implies that you won't get any new field > information (such as script and extensions) returned by toString(). > > In the current proposed API set, we have toLanguageTag(), which > returns syntactically valid BCP47 language tag string. However, > subtags in a BCP47 language tag is delimited by hyphen('-') instead > of underscore('_'). One of the goals in this project is to include > script field value involved in the resource bundle lookup > inheritance. Therefore, I would like to have a method creating a > locale string delimited by underscore, which can be used for > resource bundle suffix. (Technically, this can be achieved by > composing the string by appending getLanguage(), getScript()...) > This is a common operation and I think it is worth having such API. > > I'm considering following three APIs for the purpose. > > String toFullString() > Locale getBaseLocale() > Locale getParent() > > > toFullString() is a variant of toString() to generate a string > representation of Locale, but also include script and extensions if > they are available. > > getBaseLocale() returns a Locale (proposed implementation is to > return a singleton) without locale extensions. Locale extensions is > not used for resource bundle lookup. > > getParent() returns a parent Locale (proposed implementation is to > return a singleton). A parent locale represent a locale omitting > the most right field of its child locale. For example, Locale("en") > is a parent locale of Locale("en", "US"). If a locale has a variant > field and the variant field contains one or more underscore > characters, then its parent still have variant field, but excluding > the substring after the last underscore. For example, Locale("en", > "US", "NYC") is a parent locale of Locale("en", "US", "NYC_JFK") > > With these 3 APIs, the resource bundle is collecting key-value pairs > with the pseudo code below - > > Locale target; // the resolved Locale > Locale loc = target; > ResourceBundleImpl child = null; > while (true) { > ResourceBundleImpl aBundle = loadFrom(bundleBaseName + "_" + > loc.getBaseLocale().toFullString()); > if (child != null) { > child.parent = aBundle; > } > loc = loc.getParent(); > if (loc == null) { > // Locale.ROOT.getParent() returns null > break; > } > child = aBundle; > } > > Do you think we should have such APIs? Also, if you do, do you want > to make them public or keep them package local/private? > > -Yoshito > > From y.umaoka at gmail.com Thu Feb 26 09:19:24 2009 From: y.umaoka at gmail.com (Yoshito Umaoka) Date: Thu, 26 Feb 2009 12:19:24 -0500 Subject: [loc-en-dev] [Fwd: Re: About Locale#toString()] Message-ID: <49A6CF1C.8080401@gmail.com> BCP47 allows multiple variants used at the same time and there is actually some valid use cases. For example, sl-si-1994-rozaj-solba is a valid language tag (variant 1994/rozaj/solba are registered in IANA registry) and it actually make sense. According to Mark Davis, the IANA registry data does not force the order of variant tags and following language tags are semantically equivalent with above example - sl-si-1994-solba-rozaj sl-si-rozaj-1994-solba sl-si-rozaj-solba-1994 sl-si-solba-1994-rozaj sl-si-solba-rozaj-1994 For matching, the order of variant matters. I cannot find any description about canonical ordering of multiple variants in RFC4646bis-20. But when I asked about this to Mark, he said that canonical representation should be in the natural alphabetical order, which is, 1994-rozaj-solba. (Mark, is this correct?) I think locale inheritance within variants should be compatible with the current JDK's behavior and it is consistent with the language tag matching part of BCP47. -Yoshito -------- Original Message -------- Subject: Re: [loc-en-dev] About Locale#toString() Date: Wed, 18 Feb 2009 15:50:01 -0500 From: Yoshito Umaoka To: Doug Felt CC: locale-enhancement-dev at openjdk.java.net References: <499C3F36.3090906 at gmail.com> <146f39a80902181131u2784864cw8333331e39a95445 at mail.gmail.com> Thank you for pointing out the problem. I mixed up the ICU implementation with JDK. You're right and we do not need to support inheritance within variant component. Actually, this opens up another question. BCP47 itself supports multiple variant values by its syntax definition. The IANA language tag directory contains following variants - 1606nict 1694acad 1901 1959acad 1994 1996 arevela arevmda baku1926 biske boont fonipa fonupa kkcor lipaw monoton nedis njiva osojs pinyin polyton rozaj scotland scouse solba tarask uccor ucrcor valencia wadegile These values are actually constrained by prefix. For example, %% Type: variant Subtag: scotland Description: Scottish Standard English Added: 2007-08-31 Prefix: en %% Type: variant Subtag: scouse Description: Scouse Added: 2006-09-18 Prefix: en Comments: English Liverpudlian dialect known as 'Scouse' So "en-scotland" is a valid language tag, "en-scouse" is also a valid language tag, but I'm not sure about "en-scotland-scouse" or "en-scouse-scotlan". Practically, such combination does not make sense. But I could not find any description that explains these are invalid. If these are valid, language range "en-scotland" could match "en-scotland-scouse" by the RFC4647 part of BCP47. Anyway, I cannot imagine any practical language tag which has multiple IANA registered variants, I think it's probably OK to process variant as a single field and not supprting inheritance within a variant. I'll check LTRU folks if multiple variant values are currently allowed. -Yoshito Doug Felt wrote: > What is the motivation for treating the variant field > underscore-by-underscore rather than as an entire unit? > > Doug > > On Wed, Feb 18, 2009 at 9:02 AM, Yoshito Umaoka > wrote: > > In the bi-weekly project call, we agreed not to change the behavior > of toString(). This implies that you won't get any new field > information (such as script and extensions) returned by toString(). > > In the current proposed API set, we have toLanguageTag(), which > returns syntactically valid BCP47 language tag string. However, > subtags in a BCP47 language tag is delimited by hyphen('-') instead > of underscore('_'). One of the goals in this project is to include > script field value involved in the resource bundle lookup > inheritance. Therefore, I would like to have a method creating a > locale string delimited by underscore, which can be used for > resource bundle suffix. (Technically, this can be achieved by > composing the string by appending getLanguage(), getScript()...) > This is a common operation and I think it is worth having such API. > > I'm considering following three APIs for the purpose. > > String toFullString() > Locale getBaseLocale() > Locale getParent() > > > toFullString() is a variant of toString() to generate a string > representation of Locale, but also include script and extensions if > they are available. > > getBaseLocale() returns a Locale (proposed implementation is to > return a singleton) without locale extensions. Locale extensions is > not used for resource bundle lookup. > > getParent() returns a parent Locale (proposed implementation is to > return a singleton). A parent locale represent a locale omitting > the most right field of its child locale. For example, Locale("en") > is a parent locale of Locale("en", "US"). If a locale has a variant > field and the variant field contains one or more underscore > characters, then its parent still have variant field, but excluding > the substring after the last underscore. For example, Locale("en", > "US", "NYC") is a parent locale of Locale("en", "US", "NYC_JFK") > > With these 3 APIs, the resource bundle is collecting key-value pairs > with the pseudo code below - > > Locale target; // the resolved Locale > Locale loc = target; > ResourceBundleImpl child = null; > while (true) { > ResourceBundleImpl aBundle = loadFrom(bundleBaseName + "_" + > loc.getBaseLocale().toFullString()); > if (child != null) { > child.parent = aBundle; > } > loc = loc.getParent(); > if (loc == null) { > // Locale.ROOT.getParent() returns null > break; > } > child = aBundle; > } > > Do you think we should have such APIs? Also, if you do, do you want > to make them public or keep them package local/private? > > -Yoshito > > From y.umaoka at gmail.com Fri Feb 27 08:59:08 2009 From: y.umaoka at gmail.com (Yoshito Umaoka) Date: Fri, 27 Feb 2009 11:59:08 -0500 Subject: [loc-en-dev] Updated JavaDoc Message-ID: <49A81BDC.30305@gmail.com> http://sites.google.com/site/openjdklocale/apis I updated the JavaDoc based on things found during the implementation and some inputs from Doug. - Locale.LocaleExtension / Locale.LocaleKeywords are gone. - Setters in LocaleBuilder no longer throw an exception for a malformed input. - LocaleBuilder#create() - ignore malformed fields. - LocaleBuilder#createStrict() - return null when any malformed fields are found. -Yoshito From y.umaoka at gmail.com Fri Feb 27 14:28:14 2009 From: y.umaoka at gmail.com (y.umaoka at gmail.com) Date: Fri, 27 Feb 2009 22:28:14 +0000 Subject: [loc-en-dev] Should we createa a Locale instance with language code "he"? Message-ID: <001636164ad11565670463edfc08@google.com> Java Locale constructor always maps language code "he" to "iw" (also done for a couple of more cases). This implementation forces people to tag resources with the deprecated language tag "iw". We discussed about this and we thought it could be resolved by the locale resource look up code. ResourceBundle.Control#getCandidateList represents the suggested look up order. If we allow a Locale instance to store "he" as language code, we could create a lookup candidate list like - he_IL -> iw_IL -> he -> iw. Here is the design question - 1. The current JDK implementation returns the same result for new Locale("he") and new Locale("iw"). If we allow people to create an instance of Locale with language code "he" internally (in other words, Locale#getLanguage() returns "he", instead of "iw"), we can resolve the candidate list problem easily. But, in this case, new Locale("iw").equals(new Locale("he")) should return true. 2. Another option is to JDK's own ResourceBundle lookup code to search both he and iw, but the candidate list only contains Locales with "iw". This implementation has no impacts to the existing apps. But doing special thing in JDK implementation is somewhat ugly. 3. Yet another option is to keep the Locale constructor's behavior (silently maps "he" -> "iw"), but LocaleBuilder to allow people to create a Locale instance with language code "he". This has less impact to existing applications, but like option 1 above, we need special handling for testing equality. I think 1 (change the Locale constructor's behavior) is problematic. I guess we should either go for 2 or 3. I personally prefer the 3rd option. What do you think? -Yoshito -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/locale-enhancement-dev/attachments/20090227/b33b79bf/attachment.html From Naoto.Sato at Sun.COM Fri Feb 27 14:57:15 2009 From: Naoto.Sato at Sun.COM (Naoto Sato) Date: Fri, 27 Feb 2009 14:57:15 -0800 Subject: [loc-en-dev] Should we createa a Locale instance with language code "he"? In-Reply-To: <001636164ad11565670463edfc08@google.com> References: <001636164ad11565670463edfc08@google.com> Message-ID: <49A86FCB.9010106@Sun.COM> I prefer the option 3, too. Actually I tried the option 1 back in JDK6 beta, and I got a regression from a customer at the end of the JDK6 dev period so I had to back it out. BTW, how do you search "he" resource in option 2, without changing the Locale class? Naoto y.umaoka at gmail.com wrote: > Java Locale constructor always maps language code "he" to "iw" (also > done for a couple of more cases). This implementation forces people to > tag resources with the deprecated language tag "iw". We discussed about > this and we thought it could be resolved by the locale resource look up > code. > > ResourceBundle.Control#getCandidateList represents the suggested look up > order. If we allow a Locale instance to store "he" as language code, we > could create a lookup candidate list like - he_IL -> iw_IL -> he -> iw. > Here is the design question - > > 1. The current JDK implementation returns the same result for new > Locale("he") and new Locale("iw"). If we allow people to create an > instance of Locale with language code "he" internally (in other words, > Locale#getLanguage() returns "he", instead of "iw"), we can resolve the > candidate list problem easily. But, in this case, new > Locale("iw").equals(new Locale("he")) should return true. > > 2. Another option is to JDK's own ResourceBundle lookup code to search > both he and iw, but the candidate list only contains Locales with "iw". > This implementation has no impacts to the existing apps. But doing > special thing in JDK implementation is somewhat ugly. > > 3. Yet another option is to keep the Locale constructor's behavior > (silently maps "he" -> "iw"), but LocaleBuilder to allow people to > create a Locale instance with language code "he". This has less impact > to existing applications, but like option 1 above, we need special > handling for testing equality. > > I think 1 (change the Locale constructor's behavior) is problematic. I > guess we should either go for 2 or 3. I personally prefer the 3rd > option. What do you think? > > > -Yoshito -- Naoto Sato