[10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator
Ivan Gerasimov
ivan.gerasimov at oracle.com
Wed Aug 2 06:56:54 UTC 2017
Thank you Peter for sharing your thoughts and experiments!
On 7/29/17 10:22 AM, Peter Levart wrote:
> Hi Ivan,
>
> On 07/28/2017 06:25 PM, Ivan Gerasimov wrote:
>>
>> Hi Peter!
>>
>> Thank a lot for looking into this!
>>
>> On 7/28/17 7:32 AM, Peter Levart wrote:
>>> Hi Ivan,
>>>
>>> In the light of what Stuart Marks wrote, then what do you think
>>> about a parameterized comparator (would be sub-optimal, I know)
>>> where one would supply
>>> 2 Comparator(s) which would be used to compare Ax and Nx
>>> sub-sequences respectively as described below...
>>>
>> Yes. Inspired by what Stuart suggested I made a draft of such a
>> comparator (see below).
>> It works, but as you've said it's not that efficient (mostly due to
>> expensive substrings) and a bit harder to use in a simple case.
>> Now I need to think about how to combine two approaches.
>
> There is a bug in MyComparator. Comparing "1" with "2" gives 0
> (equal). You must not return "equal" when both alpha prefixes are empty.
>
Yes, you're right. Should have checked the length of numerical
substrings, or just use a separate method for end-of-sequence.
>>
>>> For Nx sub-sequences, one would need a comparator comparing the
>>> reversed sequence lexicographically,
>> I'm not sure I understand why they need to be reversed.
>
> Scrap that. I got confused. :-o
>
>>> but for Ax sub-sequences, one could choose from a plethora of
>>> case-(in)sensitive comparator(s) and collators already available on
>>> the platform.
>>>
>> Yes. In the example below I used compareToIgnoreCase to compare alpha
>> subsequences.
> ...
>> Arrays.sort(str,new
>> MyComparator(Comparator.comparing(String::toString,String::compareToIgnoreCase),Comparator.comparing(String::toString,String::compareTo)));
> Substrings are by copy. There is another approach. Originally you
> created the NumericComparator to compare CharSequence(s) - not
> String(s) as you did with MyComparator. So why not keeping that?
Well, it was mostly for illustrative purposes. Of course the final code
should deal with CharSequences, not Strings for generality. However, I
assume that in most cases this comparator will need to work with
Strings, and it's important to remember that String.subSequence() is as
expensive as String.substring() [1] [1]
http://docs.oracle.com/javase/8/docs/api/java/lang/String.html#subSequence-int-int-
> While talking about types, you don't need to have a generic parameter:
> NumericComparator<T extends CharSequence> implements
> CharSequence<T> simple: NumericComparator implements
> Comparator<CharSequence> is suitable for comparing any subtype of
> CharSequence as use-sites should accept Comparator<? super String> or
> Comparator<? super CharSequence>.
But it wouldn't work in a case you need to pass the comparator to a
function of form fun(Comparator<String> cmp).
> For example: Stream<T> sorted(Comparator<? super T> comparator);
> You can use Comparator<CharSequence> to sort Stream<String> as well as
> Stream<StringBuilder> or Stream<CharSequence>... I tried an approach
> where sub-sequences are CharSequence objects by delegation and use
> sub-comparators that don't convert CharSequence(s) to String(s).
> Here's what this looks like:
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/AlphadecimalComparator.java
> with supporting sub-comparator classes:
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/CharSequenceComparator.java
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/DecimalComparator.java
> I created a JMH benchmark comparing your 02/webrev
> (ivan.NumericComparator):
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/ivan/NumericComparator.java
> with a modified variant of your MyComparator that works correctly
> (ivan.MyComparator):
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/ivan/MyComparator.java
> and my (peter.AlphadecimalComparator). The later two are tested with
> case-sensitive (CS) and case-insensitive (CIS) variants for comparing
> alpha sub-sequences. Here are the results:
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/bench_results_speed.txt
> With -prof gc, it can be seen that JIT is able to optimize-away
> allocation of sub-sequence CharSequence(s). When the hot code is
> JIT-ed, at least with provided helper sub-comparators, no allocation
> is taking place (not so with MyComparator which creates substrings):
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/bench_results_gc.txt
That's a very good news then!
> The performance of AlphadecimalComparator is about 50% slower than
> your specialized NumericComparator, but it allows custom
> parametrization. Perhaps the decimal part of variations could be
> hidden behind a boolean or an enum parameter (there is a
> leading-zeroes-greater, leading-zeroes-less, but also a
> leading-zeroes-equal variant that makes sense). I tried to do that by
> making DecimalComparator an enum and restricting the decimalComparato
> parameter to that type, but maybe specializing code around a boolean
> flag might yield better performance and DecimalComparator as
> Comparator<CharSequence> doesn't make much sense outside the
> AlphadecimalComparator. So what do you think? Also, what do you think
> about the name? Regards, Peter
I've tried to go one step further and created even more abstract
comparator: It uses a supplied predicate to decompose the input
sequences into odd/even subsequences (e.g. alpha/numeric) and then uses
two separate comparator to compare them. Additionally, a comparator for
comparing sequences, consisting only of digits is provided. For example,
to build a case-insensitive AlphaDecimal comparator one could use: 1)
Character::isDigit -- as the predicate for decomposing, 2)
String::compareToIgnoreCase -- to compare alpha (i.e. odd parts); to
work with CharSequences one would need to make it
Comparator.comparing(CharSequence::toString,
String::compareToIgnoreCase), 3) The special decimal-only comparator,
which compares the decimal representation of the sequences. Here's the
file with all the comparators and a simple test:
http://cr.openjdk.java.net/~igerasim/8134512/test/Test.java
--
With kind regards,
Ivan Gerasimov
More information about the core-libs-dev
mailing list