[loc-en-dev] -u- extension vs. other extensions

Andy Staudacher staudacher at google.com
Wed Jun 30 23:26:41 PDT 2010


On Wed, Jun 30, 2010 at 10:51 AM, Yoshito Umaoka <y.umaoka at gmail.com> wrote:

> Hi all,
>
> We agreed that we validate syntax of subtags, but do not validate code
> itself in Java. In other words, proposed implementation won't invalidate
> language subtag "xx" although the use of such code is not valid for BCP 47
> language tag.
>
> In BCP47, extension is defined as:
>
> extension = singleton 1*("-" (2*8alphanum)
>
> When the previous proposal was written last year, the Unicode locale
> extension ('u' extension) only allows key/type subtag pairs.  In BNF,
>
> unicode_locale_extensions = sep "u" 1*(sep keyword)
> keyword = key sep type
> key = 2alphanum
> type = 3*8alphanum
>
> This require special syntax validation for 'u' extension.  For example,
>
> 1, extension "a-abc-de" is syntactically valid
> 2. extension "u-abc-de" is syntactically invalid, because it does not
> satisfy the requirement for 'u' extension (key(2alphanum) must be followed
> right after singleton, key must have its type(3*8alphanum).
>
>
> 'u' extension was updated in the final spec as below:
>
> unicode_locale_extensions = sep "u" (
>                                           1*(sep keyword)
>                                           / 1*(sep attribute) *(sep
> keyword)
>                                         )
> keyword = key [sep type]
> key = 2alphanum
> type = 3*8alphanum * (sep 3*8alphanum)
> attribute = 3*8alphanum
>
>
> This change - 1. subtags in the form of 3*8alpha before the first
> occurrence of key (2*alphanum) is interpreted as attributes, 2. key subtag
> might not be followed by type, 3. type might be represented by multiple
> subtags in the form of 3*8alphanum - actually eliminates the special syntax
> requirements for 'u' extension.  With the updated specification, extension
> subtags satisfying the BCP47 extension syntax are also satisfying the 'u'
> extension.  For example, "u-abc-de" is interpreted as attribute "abc" and
> typeless key "de". (Note that this specific tag is illegal because "abc" is
> not a registered attribute and "de" is not a known key value)
>
> With this change, we do not need any special coding for handling 'u'
> extension in the API - Builder#setExtension.  This also means that we do not
> need to add special implementation dedicated for 'u' extension even we do
> not add the Unicode locale extension APIs (such as
> Builder#setUnicodeLocaleKeyword).


Indeed. Great insight! The "u" singleton can be followed by alphanum{2} or
alphanum{3,8} and any alphanum{3,8} can (but doesn't have to) be followed by
a alphanum{3,8} or a alphanum{2}, and vice-versa. I.e. "u" must be followed
by 1*("-" (2*8alphanum)), which is the same syntax any BCP 47 extension must
satisfy. .

Let's document this as a series of test cases with some comments for the
tests. I could take on this task if you want me to.

Thanks,
 - Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/locale-enhancement-dev/attachments/20100630/13a02e98/attachment.html 


More information about the locale-enhancement-dev mailing list