[10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator
Stuart Marks
stuart.marks at oracle.com
Thu Jul 27 23:50:19 UTC 2017
Hi Ivan,
I think this is an interesting avenue to explore adding to the platform. The
idea of sorting this way is pretty subtle and it seems to come up frequently, so
it seems valuable. There are some issues that warrant further discussion,
though. Briefly:
1. Should this be in the JDK?
2. What do other platforms do?
3. Does it have the right semantics?
Discussion follows.
--
1. Should this be in the JDK?
I think a case for it can be made. It does appear in other platforms (see below)
and there are also several third party implementations available in a variety of
environments. So people do have a need for this feature. It's also complicated
enough to have generated lots of discussions and articles on the topic. The
questions are whether this can be specified sufficiently clearly, and whether it
provides value for the use cases for which it's intended. It's not obvious
whether this is true, but I believe a case can and should be made.
2. What do other platforms do?
It was a bit difficult to find information about this, since it doesn't seem to
have a well established name. Words like "natural", "logical", "alphanum", and
"mixed" tend to be used. I eventually found these:
Windows XP StrCmpLogicalW [1]:
Compares two Unicode strings. Digits in the strings are considered as
numerical content rather than text. This test is not case-sensitive.
Windows 7 CompareStringEx SORT_DIGITSASNUMBERS [2]
Treat digits as numbers during sorting, for example, sort "2" before "10".
(Note: this API takes a locale parameter.)
Macintosh Mac NSString localizedStandardCompare [3]
This method should be used whenever file names or other strings are
presented in lists and tables where Finder-like sorting is appropriate.
The exact sorting behavior of this method is different under different
locales and may be changed in future releases. This method uses the
current locale.
(Note: I observe that the Mac Finder sorting is case insensitive.)
Swift String.localizedStandardCompare [4]
Compares strings as sorted by the Finder.
There are also third party, open source implementations available for a variety
of platforms. These aren't too hard to find; this Coding Horror article [5] has
a discussion of the issues and links to several implementations. Of particular
note is the short Python implementation embedded in the article.
There is also the Node package javascript-natural-sort [6] which is one of
several (of course) similar packages on NPM. This one seems popular, with more
than 200,000 downloads in the past month.
Finally, there is mention of "numericOrdering" in this Unicode TR [7] but it
seems fairly non-specific, and I don't know how it applies. The point here is
that the Unicode community is aware of this kind of ordering, and various
libraries that implement Unicode collation, such as ICU [8], might have
implementations that can provide guidance.
3. Does it have the right semantics?
I think you can see from the above survey that there is no standard, and
different implementations are all over the map, and most if not all are
completely ill-specified. But what is useful about the survey is that it shows
what people are actually using, and that there are things that many of them have
in common. Two items jump out at me:
- case-insensitive comparison (sometimes optional)
- locale-specific collation
The obvious (but simplistic) thing to do is to provide variations of this API
that can use String.CASE_INSENSITIVE_ORDER. Note however that its doc
specifically states that it provides "unsatisfactory ordering for certain
locales" and directs the reader to the Collator class, which does take locale
into account.
Now, I'm sensitive about making this more complicated than necessary. But the
point of "logical" comparator is to provide something that makes sense to humans
looking at the result, which implies that locale-specific collation needs to be
applied, as well as case insensitivity (which itself is locale-specific). So I
think consideration of those is indeed necessary.
I don't know what the API should look like. The java.text.Collator class
implements Comparator. This suggests the possibility of an API that allows a
"downstream" comparator to be specified, to which ordering of certain
subsequences can be delegated.
s'marks
[1] https://msdn.microsoft.com/en-us/library/windows/desktop/bb759947(v=vs.85).aspx
[2] https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761(v=vs.85).aspx
[3]
https://developer.apple.com/documentation/foundation/nsstring/1409742-localizedstandardcompare?language=objc
[4]
https://developer.apple.com/documentation/swift/string/1408384-localizedstandardcompare
[5] https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/
[6] https://www.npmjs.com/package/javascript-natural-sort
[7] http://unicode.org/reports/tr35/tr35-collation.html#Setting_Options
[8] http://userguide.icu-project.org/collation
On 7/19/17 1:41 AM, Ivan Gerasimov wrote:
> Hello!
>
> It is a proposal to provide a String comparator, which will pay attention to the
> numbers embedded into the strings (should they present).
>
> This proposal was initially discussed back in 2014 and seemed to bring some
> interest from the community:
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html
>
> In the latest webrev two methods are added to the public API:
> j.u.Comparator.comparingNumerically() and
> j.u.Comparator.comparingNumericallyLeadingZerosAhead().
>
> The regression test is extended to exercise this new comparator.
>
> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
> WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/
>
> Comments, suggestions are very welcome!
>
More information about the core-libs-dev
mailing list