[10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator
Ivan Gerasimov
ivan.gerasimov at oracle.com
Mon Sep 25 17:49:07 UTC 2017
Hello!
Could you please review at your convenience?
In the latest webrev I took all suggestions into account (unless I
missed something.)
http://cr.openjdk.java.net/~igerasim/8134512/04/webrev/
I think, if the suggested comparator is found useful by the users, then
it may make sense to create the String-oriented variant, which can be
implemented through the CharSequence-oriented one as:
class String {
...
@SuppressWarnings("unchecked")
public static <T extends String> Comparator<T>
comparingAlphaDecimal(Comparator<? super String> alphaComparator) {
return (Comparator<T>) (Comparator)
new
Comparators.AlphaDecimalComparator<>(Objects.requireNonNull(
(Comparator<CharSequence>) alphaComparator),
false);
}
}
This will be safe, since the specification guarantees that
String.subSequence() returns a String.
Then in the application code it would be possible to instantiate the
comparators as
String.comparingAlphaDecimal(String::compareTo);
String.comparingAlphaDecimal(String::compareToIgnoreCase);
or, alternatively,
String.comparingAlphaDecimal(Comparator.naturalOrder());
String.comparingAlphaDecimal(String.CASE_INSENSITIVE_ORDER);
But this could be deferred for later, of course.
With kind regards,
Ivan
On 8/27/17 1:38 PM, Ivan Gerasimov wrote:
> Hello everyone!
>
> Here's another iteration of the comparator with suggested improvements.
>
> Now, there is the only input argument -- the alpha-comparator for
> comparing the non-decimal-digit sub-sequences.
>
> For the javadoc I used the text suggested by Peter with some
> modifications, additional example and API/implementation notes.
> Overall, the javadoc looks heavier than need to me, so I'd love to
> hear comments about how to make it shorter and cleaner.
>
> Also, I adopted the name AlphaDecimal, suggested by Peter. This name
> is one of popular in the list of variants found in the wild. So, there
> are higher chances the users can find the routine by its name.
>
> For testing if a code point is a decimal digit, I used
> (Character.getType(cp) == Character.DECIMAL_DIGIT_NUMBER), which seem
> to be more appropriate than Character.isDigit(). (The later is true
> for things like a digit in a circle, superscript, etc., which do not
> seem to be a part of a decimal number composed of several digits.)
>
> The updated webrev:
> http://cr.openjdk.java.net/~igerasim/8134512/04/webrev/
>
> Please review at your convenience.
>
> With kind regards,
> Ivan
>
> On 8/9/17 4:59 PM, Stuart Marks wrote:
>> On 8/1/17 11:56 PM, Ivan Gerasimov wrote:
>>> I've tried to go one step further and created even more abstract
>>> comparator: It
>>> uses a supplied predicate to decompose the input sequences into
>>> odd/even
>>> subsequences (e.g. alpha/numeric) and then uses two separate
>>> comparator to
>>> compare them. Additionally, a comparator for comparing sequences,
>>> consisting
>>> only of digits is provided. For example, to build a case-insensitive
>>> AlphaDecimal comparator one could use: 1) Character::isDigit -- as
>>> the predicate
>>> for decomposing, 2) String::compareToIgnoreCase -- to compare alpha
>>> (i.e. odd
>>> parts); to work with CharSequences one would need to make it
>>> Comparator.comparing(CharSequence::toString,
>>> String::compareToIgnoreCase), 3)
>>> The special decimal-only comparator, which compares the decimal
>>> representation
>>> of the sequences. Here's the file with all the comparators and a
>>> simple test:
>>> http://cr.openjdk.java.net/~igerasim/8134512/test/Test.java
>>
>> Hi, a couple follow-up thoughts on this.
>>
>> 1) Supplementary characters
>>
>> The current code uses Character.isDigit(char), which works only for
>> char values in the BMP (basic multilingual plane, values <= U+FFFF).
>> It won't work for supplementary characters. There are several blocks
>> of digits in the BMP, but there are several more in the supplementary
>> character range.
>>
>> I don't see any reason not to handle the supplementary characters as
>> well, except that it spoils the nice char-by-char technique of
>> processing the string. Instead, it'd have to pull in code point
>> values, which might be comprised of two surrogate chars. There are a
>> variety of methods on Character that help with this. Note that there
>> is an overload Character.isDigit(int) which takes any code point
>> value, including supplementary characters.
>>
>> 2) Too much generality?
>>
>> This version includes Predicate<Character> for determining whether a
>> character is part of the alphabetic or decimal portion of the string.
>> I'm thinking this might be overkill. It might be sufficient to
>> "hardwire" the partitioning predicate to be Character::isDigit and
>> the value mapping function to use Character::digit.
>>
>> The problem is that adding a predicate opens the door to a lot more
>> complexity, while providing dimishing value. First, the predicate
>> would have to handle code points (per the above) so it'd need to be
>> an IntPredicate. Second, there would also need to be a mapping
>> function from the code point value to a numeric value. This might be
>> an IntUnaryOperator. This would allow someone to sort based on Roman
>> numerals, using Character::getNumericValue. (Yes, Roman numerals are
>> in Unicode.) Or maybe the mapping function should return any
>> Comparable value, not an int. ... See where I'm going here?
>>
>> Since this kind of sorting is intended to be viewed by people, it's
>> probably worth providing full internationalization support
>> (supplementary characters, and delegation to sub-comparators, to
>> allow locale-specific collating sequences). But I start to question
>> any complexity beyond that.
>>
>> s'marks
>>
>
--
With kind regards,
Ivan Gerasimov
More information about the core-libs-dev
mailing list