[10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Ivan Gerasimov ivan.gerasimov at oracle.com
Wed Aug 2 06:56:54 UTC 2017


Thank you Peter for sharing your thoughts and experiments!


On 7/29/17 10:22 AM, Peter Levart wrote:
> Hi Ivan,
>
> On 07/28/2017 06:25 PM, Ivan Gerasimov wrote:
>>
>> Hi Peter!
>>
>> Thank a lot for looking into this!
>>
>> On 7/28/17 7:32 AM, Peter Levart wrote:
>>> Hi Ivan,
>>>
>>> In the light of what Stuart Marks wrote, then what do you think 
>>> about a parameterized comparator (would be sub-optimal, I know) 
>>> where one would supply
>>> 2 Comparator(s) which would be used to compare Ax and Nx 
>>> sub-sequences respectively as described below...
>>>
>> Yes.  Inspired by what Stuart suggested I made a draft of such a 
>> comparator (see below).
>> It works, but as you've said it's not that efficient (mostly due to 
>> expensive substrings) and a bit harder to use in a simple case.
>> Now I need to think about how to combine two approaches.
>
> There is a bug in MyComparator. Comparing "1" with "2" gives 0 
> (equal). You must not return "equal" when both alpha prefixes are empty.
>
Yes, you're right.  Should have checked the length of numerical 
substrings, or just use a separate method for end-of-sequence.

>>
>>> For Nx sub-sequences, one would need a comparator comparing the 
>>> reversed sequence lexicographically,
>> I'm not sure I understand why they need to be reversed.
>
> Scrap that. I got confused.  :-o
>
>>> but for Ax sub-sequences, one could choose from a plethora of 
>>> case-(in)sensitive comparator(s) and collators already available on 
>>> the platform.
>>>
>> Yes. In the example below I used compareToIgnoreCase to compare alpha 
>> subsequences.
> ...
>> Arrays.sort(str,new 
>> MyComparator(Comparator.comparing(String::toString,String::compareToIgnoreCase),Comparator.comparing(String::toString,String::compareTo)));
> Substrings are by copy. There is another approach. Originally you 
> created the NumericComparator to compare CharSequence(s) - not 
> String(s) as you did with MyComparator. So why not keeping that? 
Well, it was mostly for illustrative purposes.  Of course the final code 
should deal with CharSequences, not Strings for generality. However, I 
assume that in most cases this comparator will need to work with 
Strings, and it's important to remember that String.subSequence() is as 
expensive as String.substring() [1] [1] 
http://docs.oracle.com/javase/8/docs/api/java/lang/String.html#subSequence-int-int- 

> While talking about types, you don't need to have a generic parameter: 
>     NumericComparator<T extends CharSequence> implements 
> CharSequence<T> simple:     NumericComparator implements 
> Comparator<CharSequence> is suitable for comparing any subtype of 
> CharSequence as use-sites should accept Comparator<? super String> or 
> Comparator<? super CharSequence>. 
But it wouldn't work in a case you need to pass the comparator to a 
function of form  fun(Comparator<String> cmp).
> For example:     Stream<T> sorted(Comparator<? super T> comparator); 
> You can use Comparator<CharSequence> to sort Stream<String> as well as 
> Stream<StringBuilder> or Stream<CharSequence>... I tried an approach 
> where sub-sequences are CharSequence objects by delegation and use 
> sub-comparators that don't convert CharSequence(s) to String(s). 
> Here's what this looks like: 
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/AlphadecimalComparator.java 
> with supporting sub-comparator classes: 
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/CharSequenceComparator.java 
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/DecimalComparator.java 
> I created a JMH  benchmark comparing your 02/webrev 
> (ivan.NumericComparator): 
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/ivan/NumericComparator.java 
> with a modified variant of your MyComparator that works correctly 
> (ivan.MyComparator): 
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/ivan/MyComparator.java 
> and my (peter.AlphadecimalComparator). The later two are tested with 
> case-sensitive (CS) and case-insensitive (CIS) variants for comparing 
> alpha sub-sequences. Here are the results: 
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/bench_results_speed.txt 
> With -prof gc, it can be seen that JIT is able to optimize-away 
> allocation of sub-sequence CharSequence(s). When the hot code is 
> JIT-ed, at least with provided helper sub-comparators, no allocation 
> is taking place (not so with MyComparator which creates substrings): 
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/bench_results_gc.txt 

That's a very good news then!
> The performance of AlphadecimalComparator is about 50% slower than 
> your specialized NumericComparator, but it allows custom 
> parametrization. Perhaps the decimal part of variations could be 
> hidden behind a boolean or an enum parameter (there is a 
> leading-zeroes-greater, leading-zeroes-less, but also a 
> leading-zeroes-equal variant that makes sense). I tried to do that by 
> making DecimalComparator an enum and restricting the decimalComparato 
> parameter to that type, but maybe specializing code around a boolean 
> flag might yield better performance and DecimalComparator as 
> Comparator<CharSequence> doesn't make much sense outside the 
> AlphadecimalComparator. So what do you think? Also, what do you think 
> about the name? Regards, Peter 
I've tried to go one step further and created even more abstract 
comparator:  It uses a supplied predicate to decompose the input 
sequences into odd/even subsequences (e.g. alpha/numeric) and then uses 
two separate comparator to compare them. Additionally, a comparator for 
comparing sequences, consisting only of digits is provided. For example, 
to build a case-insensitive AlphaDecimal comparator one could use: 1) 
Character::isDigit -- as the predicate for decomposing, 2) 
String::compareToIgnoreCase -- to compare alpha (i.e. odd parts); to 
work with CharSequences one would need to make it 
Comparator.comparing(CharSequence::toString, 
String::compareToIgnoreCase), 3) The special decimal-only comparator, 
which compares the decimal representation of the sequences. Here's the 
file with all the comparators and a simple test: 
http://cr.openjdk.java.net/~igerasim/8134512/test/Test.java

-- 
With kind regards,
Ivan Gerasimov


More information about the core-libs-dev mailing list