[loc-en-dev] -u- extension API - necessary updates?

Wed Jun 30 13:30:24 PDT 2010

In the Locale Enhancement repository, we have following proposed APIs 
supporting -u- extension:

In java.util.Locale

public Set<String> getUnicodeLocaleKeys()
public String getUnicodeLocaleType(String key)

In java.util.Locale.Builder

public Builder setUnicodeLocaleKeyword(String key, String type)

Following Unicode locale extension are not in our scope last year.

1. type represented by multiple subtags
2. key without type
3. attribute

For supporting 1, it looks we do not need any changes in the proposal.  
A Unicode locale extension keyword may have type represented by multiple 
subtags. For example, "en-u-vt-0061-0065" is a valid example defined by 
the current LDML specification (See 
http://www.unicode.org/reports/tr35/#Locale_Extension_Key_and_Type_Data).

However, this does not mean that a keyword may have multiple types. In 
this example, 0061 and 0065 are not two different types - instead 
"0061-0065" is a type. Thus, getUnicodeLocaleType("vt") can simply 
return "0061-0065".  To set the type using Builder, 
setUnicodeLocaleKeyword("vt", "0061-0065") is sufficient.

For supporting 2, there is a minor conflict with the current proposal. 
Assume we have a Locale represented by pseudo language tag 
"en-u-aa-bb-ccc". getUnicodeLocaleKeys() will return a set containing 
"aa" and "bb". getUnicodeLocaleType(String key) currently returns null 
when the input key is not available, and it returns non-empty type 
string when the key is available. We could use empty string "" to 
represent typeless keyword - that is, getUnicodeLocaleType("aa") to 
return "" in this example.

The remaining question is the Builder API - 
setUnicodeLocaleKeyword(String key, String type). For now, empty string 
type indicate that the keyword itself is removed from the current state 
and null type throws NPE. We could change the API to use null for 
deletion instead of empty string. For example, if an Builder internally 
represents "en-u-aa-bb-ccc", setUnicodeLocaleKeyword("aa", null) will 
remove the typeless keyword "aa" - and internal representation will be 
changed to "en-u-bb-ccc" after the call. Also, 
setUnicodeLocaleKeyword("dd", "") will append a typeless keyword "dd" to 
the internal state (that is, "en-u-aa-bb-ccc-dd").

Note that setXXX with empty string is removing a field from Builder by 
the current design. If we really want to change the semantics of empty 
string and null in  the API setUnicodeLocaleKeyword, the consistent 
policy should be applied to others (for example, setLanguage(null) to 
remove language field, instead of setLanguage("")).

For supporting 3, we could treat an attribute as keyless keyword. But it 
makes getUnicodeLocaleKeys()/getUnicodeLocaleType(String key) a little 
bit awkward. Technically, we can still design them like that way 
(getUnicodeLocaleKeys() to include an empty string in the return set / 
getUnicodeLocaleType("") to return attribute subtags). I think adding 
extra API dedicated for attribute is cleaner.

public Set<String> getUnicodeLocaleAttributes()

The same idea is applicable to Builder. The API dedicated for 
adding/removing Unicode locale attribute like below may be added:

public Builder addUnicodeLocaleAttribute(String attribute)
public Builder removeUnicodeLocaleAttribute(String attribute)

Another possibility is to multiple attributes as a whole.

public Builder setUnicodeLocaleAttribute(String attributes)

For example, setting attribute "abc" and "def", 
setUnicodeLocaleAttributes("abc-def"). If we go for this approach, we do 
not need "remove" method. A tricky part is that the order of attributes 
does not matter. So, semantically, "abc-def" and "def-abc" are same. We 
do not want to introduce unnecessary variations, we should clearly state 
that the order of attributes are not preserved.

Another question related to this - Set<String> vs. List<String>. 
Currently, getUnicodeLocaleKeys() returns Set<String> (actually, 
unmodifiable set). Semantically, the order of keywords does not matter. 
"u-ca-japanese-cu-jpy" is equivalent to "u-cu-jpy-ca-japanese". But we 
do use canonical order (alphabetical order of keys) when a Locale is 
converted to a language tag. From this point of view, List<String> might 
be more appropriate. This also applies to attributes. If we agree to 
support Unicode locale attributes with dedicated APIs like above, we 
should decide if the collection of attributes should be represented by 
Set or List.

Overall, supporting full specification of Unicode locale extension looks 
not too bad. Some may argue why we add APIs dedicated for things which 
are not yet used. We could defer adding "attribute" APIs - and attribute 
can be only set via Builder.setExtension('u', "...."). But necessary API 
addition is pretty minimal and with these APIs, the design look more 
complete. Therefore, if we are going to include any 'u' extension 
specific APIs, I want to do it completely including attribute support.

-Yoshito