[10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Stuart Marks stuart.marks at oracle.com
Wed Aug 9 23:59:25 UTC 2017


On 8/1/17 11:56 PM, Ivan Gerasimov wrote:
> I've tried to go one step further and created even more abstract comparator:  It
> uses a supplied predicate to decompose the input sequences into odd/even
> subsequences (e.g. alpha/numeric) and then uses two separate comparator to
> compare them. Additionally, a comparator for comparing sequences, consisting
> only of digits is provided. For example, to build a case-insensitive
> AlphaDecimal comparator one could use: 1) Character::isDigit -- as the predicate
> for decomposing, 2) String::compareToIgnoreCase -- to compare alpha (i.e. odd
> parts); to work with CharSequences one would need to make it
> Comparator.comparing(CharSequence::toString, String::compareToIgnoreCase), 3)
> The special decimal-only comparator, which compares the decimal representation
> of the sequences. Here's the file with all the comparators and a simple test:
> http://cr.openjdk.java.net/~igerasim/8134512/test/Test.java

Hi, a couple follow-up thoughts on this.

1) Supplementary characters

The current code uses Character.isDigit(char), which works only for char values 
in the BMP (basic multilingual plane, values <= U+FFFF). It won't work for 
supplementary characters. There are several blocks of digits in the BMP, but 
there are several more in the supplementary character range.

I don't see any reason not to handle the supplementary characters as well, 
except that it spoils the nice char-by-char technique of processing the string. 
Instead, it'd have to pull in code point values, which might be comprised of two 
surrogate chars. There are a variety of methods on Character that help with 
this. Note that there is an overload Character.isDigit(int) which takes any code 
point value, including supplementary characters.

2) Too much generality?

This version includes Predicate<Character> for determining whether a character 
is part of the alphabetic or decimal portion of the string. I'm thinking this 
might be overkill. It might be sufficient to "hardwire" the partitioning 
predicate to be Character::isDigit and the value mapping function to use 
Character::digit.

The problem is that adding a predicate opens the door to a lot more complexity, 
while providing dimishing value. First, the predicate would have to handle code 
points (per the above) so it'd need to be an IntPredicate. Second, there would 
also need to be a mapping function from the code point value to a numeric value. 
This might be an IntUnaryOperator. This would allow someone to sort based on 
Roman numerals, using Character::getNumericValue. (Yes, Roman numerals are in 
Unicode.) Or maybe the mapping function should return any Comparable value, not 
an int. ... See where I'm going here?

Since this kind of sorting is intended to be viewed by people, it's probably 
worth providing full internationalization support (supplementary characters, and 
delegation to sub-comparators, to allow locale-specific collating sequences). 
But I start to question any complexity beyond that.

s'marks


More information about the core-libs-dev mailing list