[10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Peter Levart peter.levart at gmail.com
Sat Jul 29 17:22:33 UTC 2017


Hi Ivan,

On 07/28/2017 06:25 PM, Ivan Gerasimov wrote:
>
> Hi Peter!
>
> Thank a lot for looking into this!
>
> On 7/28/17 7:32 AM, Peter Levart wrote:
>> Hi Ivan,
>>
>> In the light of what Stuart Marks wrote, then what do you think about 
>> a parameterized comparator (would be sub-optimal, I know) where one 
>> would supply
>> 2 Comparator(s) which would be used to compare Ax and Nx 
>> sub-sequences respectively as described below...
>>
> Yes.  Inspired by what Stuart suggested I made a draft of such a 
> comparator (see below).
> It works, but as you've said it's not that efficient (mostly due to 
> expensive substrings) and a bit harder to use in a simple case.
> Now I need to think about how to combine two approaches.

There is a bug in MyComparator. Comparing "1" with "2" gives 0 (equal). 
You must not return "equal" when both alpha prefixes are empty.

>
>> For Nx sub-sequences, one would need a comparator comparing the 
>> reversed sequence lexicographically,
> I'm not sure I understand why they need to be reversed.

Scrap that. I got confused.  :-o

>> but for Ax sub-sequences, one could choose from a plethora of 
>> case-(in)sensitive comparator(s) and collators already available on 
>> the platform.
>>
> Yes. In the example below I used compareToIgnoreCase to compare alpha 
> subsequences.
...
> Arrays.sort(str,new 
> MyComparator(Comparator.comparing(String::toString,String::compareToIgnoreCase),Comparator.comparing(String::toString,String::compareTo)));

Substrings are by copy. There is another approach. Originally you 
created the NumericComparator to compare CharSequence(s) - not String(s) 
as you did with MyComparator. So why not keeping that?

While talking about types, you don't need to have a generic parameter:

     NumericComparator<T extends CharSequence> implements CharSequence<T>

simple:

     NumericComparator implements Comparator<CharSequence>

is suitable for comparing any subtype of CharSequence as use-sites 
should accept Comparator<? super String> or Comparator<? super 
CharSequence>. For example:

     Stream<T> sorted(Comparator<? super T> comparator);

You can use Comparator<CharSequence> to sort Stream<String> as well as 
Stream<StringBuilder> or Stream<CharSequence>...


I tried an approach where sub-sequences are CharSequence objects by 
delegation and use sub-comparators that don't convert CharSequence(s) to 
String(s). Here's what this looks like:

http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/AlphadecimalComparator.java

with supporting sub-comparator classes:

http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/CharSequenceComparator.java
http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/DecimalComparator.java

I created a JMH  benchmark comparing your 02/webrev 
(ivan.NumericComparator):

http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/ivan/NumericComparator.java

with a modified variant of your MyComparator that works correctly 
(ivan.MyComparator):

http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/ivan/MyComparator.java

and my (peter.AlphadecimalComparator). The later two are tested with 
case-sensitive (CS) and case-insensitive (CIS) variants for comparing 
alpha sub-sequences.

Here are the results:

http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/bench_results_speed.txt

With -prof gc, it can be seen that JIT is able to optimize-away 
allocation of sub-sequence CharSequence(s). When the hot code is JIT-ed, 
at least with provided helper sub-comparators, no allocation is taking 
place (not so with MyComparator which creates substrings):

http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/bench_results_gc.txt

The performance of AlphadecimalComparator is about 50% slower than your 
specialized NumericComparator, but it allows custom parametrization. 
Perhaps the decimal part of variations could be hidden behind a boolean 
or an enum parameter (there is a leading-zeroes-greater, 
leading-zeroes-less, but also a leading-zeroes-equal variant that makes 
sense). I tried to do that by making DecimalComparator an enum and 
restricting the decimalComparato parameter to that type, but maybe 
specializing code around a boolean flag might yield better performance 
and DecimalComparator as Comparator<CharSequence> doesn't make much 
sense outside the AlphadecimalComparator.

So what do you think? Also, what do you think about the name?

Regards, Peter



More information about the core-libs-dev mailing list