[10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Mon Sep 25 17:49:07 UTC 2017

Hello!

Could you please review at your convenience?

In the latest webrev I took all suggestions into account (unless I 
missed something.)

http://cr.openjdk.java.net/~igerasim/8134512/04/webrev/

I think, if the suggested comparator is found useful by the users, then 
it may make sense to create the String-oriented variant, which can be 
implemented through the CharSequence-oriented one as:

class String {
     ...
     @SuppressWarnings("unchecked")
     public static <T extends String> Comparator<T>
     comparingAlphaDecimal(Comparator<? super String> alphaComparator) {
         return (Comparator<T>) (Comparator)
                 new 
Comparators.AlphaDecimalComparator<>(Objects.requireNonNull(
                         (Comparator<CharSequence>) alphaComparator), 
false);
     }
}

This will be safe, since the specification guarantees that 
String.subSequence() returns a String.

Then in the application code it would be possible to instantiate the 
comparators as

         String.comparingAlphaDecimal(String::compareTo);

         String.comparingAlphaDecimal(String::compareToIgnoreCase);

or, alternatively,
         String.comparingAlphaDecimal(Comparator.naturalOrder());

String.comparingAlphaDecimal(String.CASE_INSENSITIVE_ORDER);

But this could be deferred for later, of course.

With kind regards,
Ivan

On 8/27/17 1:38 PM, Ivan Gerasimov wrote:
> Hello everyone!
>
> Here's another iteration of the comparator with suggested improvements.
>
> Now, there is the only input argument -- the alpha-comparator for 
> comparing the non-decimal-digit sub-sequences.
>
> For the javadoc I used the text suggested by Peter with some 
> modifications, additional example and API/implementation notes. 
> Overall, the javadoc looks heavier than need to me, so I'd love to 
> hear comments about how to make it shorter and cleaner.
>
> Also, I adopted the name AlphaDecimal, suggested by Peter.  This name 
> is one of popular in the list of variants found in the wild. So, there 
> are higher chances the users can find the routine by its name.
>
> For testing if a code point is a decimal digit, I used 
> (Character.getType(cp) == Character.DECIMAL_DIGIT_NUMBER), which seem 
> to be more appropriate than Character.isDigit().  (The later is true 
> for things like a digit in a circle, superscript, etc., which do not 
> seem to be a part of a decimal number composed of several digits.)
>
> The updated webrev:
> http://cr.openjdk.java.net/~igerasim/8134512/04/webrev/
>
> Please review at your convenience.
>
> With kind regards,
> Ivan
>
> On 8/9/17 4:59 PM, Stuart Marks wrote:
>> On 8/1/17 11:56 PM, Ivan Gerasimov wrote:
>>> I've tried to go one step further and created even more abstract 
>>> comparator:  It
>>> uses a supplied predicate to decompose the input sequences into 
>>> odd/even
>>> subsequences (e.g. alpha/numeric) and then uses two separate 
>>> comparator to
>>> compare them. Additionally, a comparator for comparing sequences, 
>>> consisting
>>> only of digits is provided. For example, to build a case-insensitive
>>> AlphaDecimal comparator one could use: 1) Character::isDigit -- as 
>>> the predicate
>>> for decomposing, 2) String::compareToIgnoreCase -- to compare alpha 
>>> (i.e. odd
>>> parts); to work with CharSequences one would need to make it
>>> Comparator.comparing(CharSequence::toString, 
>>> String::compareToIgnoreCase), 3)
>>> The special decimal-only comparator, which compares the decimal 
>>> representation
>>> of the sequences. Here's the file with all the comparators and a 
>>> simple test:
>>> http://cr.openjdk.java.net/~igerasim/8134512/test/Test.java
>>
>> Hi, a couple follow-up thoughts on this.
>>
>> 1) Supplementary characters
>>
>> The current code uses Character.isDigit(char), which works only for 
>> char values in the BMP (basic multilingual plane, values <= U+FFFF). 
>> It won't work for supplementary characters. There are several blocks 
>> of digits in the BMP, but there are several more in the supplementary 
>> character range.
>>
>> I don't see any reason not to handle the supplementary characters as 
>> well, except that it spoils the nice char-by-char technique of 
>> processing the string. Instead, it'd have to pull in code point 
>> values, which might be comprised of two surrogate chars. There are a 
>> variety of methods on Character that help with this. Note that there 
>> is an overload Character.isDigit(int) which takes any code point 
>> value, including supplementary characters.
>>
>> 2) Too much generality?
>>
>> This version includes Predicate<Character> for determining whether a 
>> character is part of the alphabetic or decimal portion of the string. 
>> I'm thinking this might be overkill. It might be sufficient to 
>> "hardwire" the partitioning predicate to be Character::isDigit and 
>> the value mapping function to use Character::digit.
>>
>> The problem is that adding a predicate opens the door to a lot more 
>> complexity, while providing dimishing value. First, the predicate 
>> would have to handle code points (per the above) so it'd need to be 
>> an IntPredicate. Second, there would also need to be a mapping 
>> function from the code point value to a numeric value. This might be 
>> an IntUnaryOperator. This would allow someone to sort based on Roman 
>> numerals, using Character::getNumericValue. (Yes, Roman numerals are 
>> in Unicode.) Or maybe the mapping function should return any 
>> Comparable value, not an int. ... See where I'm going here?
>>
>> Since this kind of sorting is intended to be viewed by people, it's 
>> probably worth providing full internationalization support 
>> (supplementary characters, and delegation to sub-comparators, to 
>> allow locale-specific collating sequences). But I start to question 
>> any complexity beyond that.
>>
>> s'marks
>>
>

-- 
With kind regards,
Ivan Gerasimov