[10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Stuart Marks stuart.marks at oracle.com
Thu Jul 27 23:50:19 UTC 2017


Hi Ivan,

I think this is an interesting avenue to explore adding to the platform. The 
idea of sorting this way is pretty subtle and it seems to come up frequently, so 
it seems valuable. There are some issues that warrant further discussion, 
though. Briefly:

1. Should this be in the JDK?
2. What do other platforms do?
3. Does it have the right semantics?

Discussion follows.


--


1. Should this be in the JDK?

I think a case for it can be made. It does appear in other platforms (see below) 
and there are also several third party implementations available in a variety of 
environments. So people do have a need for this feature. It's also complicated 
enough to have generated lots of discussions and articles on the topic. The 
questions are whether this can be specified sufficiently clearly, and whether it 
provides value for the use cases for which it's intended. It's not obvious 
whether this is true, but I believe a case can and should be made.


2. What do other platforms do?

It was a bit difficult to find information about this, since it doesn't seem to 
have a well established name. Words like "natural", "logical", "alphanum", and 
"mixed" tend to be used. I eventually found these:

Windows XP StrCmpLogicalW [1]:

     Compares two Unicode strings. Digits in the strings are considered as
     numerical content rather than text. This test is not case-sensitive.

Windows 7 CompareStringEx SORT_DIGITSASNUMBERS [2]

     Treat digits as numbers during sorting, for example, sort "2" before "10".

     (Note: this API takes a locale parameter.)

Macintosh Mac NSString localizedStandardCompare [3]

     This method should be used whenever file names or other strings are
     presented in lists and tables where Finder-like sorting is appropriate.
     The exact sorting behavior of this method is different under different
     locales and may be changed in future releases. This method uses the
     current locale.

     (Note: I observe that the Mac Finder sorting is case insensitive.)

Swift String.localizedStandardCompare [4]

     Compares strings as sorted by the Finder.

There are also third party, open source implementations available for a variety 
of platforms. These aren't too hard to find; this Coding Horror article [5] has 
a discussion of the issues and links to several implementations. Of particular 
note is the short Python implementation embedded in the article.

There is also the Node package javascript-natural-sort [6] which is one of 
several (of course) similar packages on NPM. This one seems popular, with more 
than 200,000 downloads in the past month.

Finally, there is mention of "numericOrdering" in this Unicode TR [7] but it 
seems fairly non-specific, and I don't know how it applies. The point here is 
that the Unicode community is aware of this kind of ordering, and various 
libraries that implement Unicode collation, such as ICU [8], might have 
implementations that can provide guidance.


3. Does it have the right semantics?

I think you can see from the above survey that there is no standard, and 
different implementations are all over the map, and most if not all are 
completely ill-specified. But what is useful about the survey is that it shows 
what people are actually using, and that there are things that many of them have 
in common. Two items jump out at me:

  - case-insensitive comparison (sometimes optional)
  - locale-specific collation

The obvious (but simplistic) thing to do is to provide variations of this API 
that can use String.CASE_INSENSITIVE_ORDER. Note however that its doc 
specifically states that it provides "unsatisfactory ordering for certain 
locales" and directs the reader to the Collator class, which does take locale 
into account.

Now, I'm sensitive about making this more complicated than necessary. But the 
point of "logical" comparator is to provide something that makes sense to humans 
looking at the result, which implies that locale-specific collation needs to be 
applied, as well as case insensitivity (which itself is locale-specific). So I 
think consideration of those is indeed necessary.

I don't know what the API should look like. The java.text.Collator class 
implements Comparator. This suggests the possibility of an API that allows a 
"downstream" comparator to be specified, to which ordering of certain 
subsequences can be delegated.

s'marks



[1] https://msdn.microsoft.com/en-us/library/windows/desktop/bb759947(v=vs.85).aspx

[2] https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761(v=vs.85).aspx

[3] 
https://developer.apple.com/documentation/foundation/nsstring/1409742-localizedstandardcompare?language=objc

[4] 
https://developer.apple.com/documentation/swift/string/1408384-localizedstandardcompare

[5] https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/

[6] https://www.npmjs.com/package/javascript-natural-sort

[7] http://unicode.org/reports/tr35/tr35-collation.html#Setting_Options

[8] http://userguide.icu-project.org/collation



On 7/19/17 1:41 AM, Ivan Gerasimov wrote:
> Hello!
>
> It is a proposal to provide a String comparator, which will pay attention to the
> numbers embedded into the strings (should they present).
>
> This proposal was initially discussed back in 2014 and seemed to bring some
> interest from the community:
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html
>
> In the latest webrev two methods are added to the public API:
> j.u.Comparator.comparingNumerically() and
> j.u.Comparator.comparingNumericallyLeadingZerosAhead().
>
> The regression test is extended to exercise this new comparator.
>
> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
> WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/
>
> Comments, suggestions are very welcome!
>


More information about the core-libs-dev mailing list