[10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Fri Jul 28 16:03:56 UTC 2017

Hi Roger!

On 7/24/17 7:42 AM, Roger Riggs wrote:
> Hi Ivan,
>
> Thanks for bringing this up again.
>
> Some initial comments, not a full review.
>
> Though the enhancement says that no consideration is given to sign 
> characters they
> may produce puzzling results for strings like "-2000" and "-2001" 
> where the strings appear
> to be signed numbers but the comparison will be for the "-" prefix and 
> the rest unsigned.
> Including the word 'unsigned' in the description might help reinforce 
> the semantics.
>
Yes, it's a good point.  I'll add some words to make it certain we only 
recognize unsigned numbers.
Otherwise, it would become confusing when comparing something like 
"2017-07-28" and "2017-07-29".

> Also, I don't see test cases for strings like: "abc200123" and 
> "abc20000123" which share
> a prefix part of which is numeric but differ in the number of 
> "leading" zeros after the prefix.
>
Sure.  Good addition to the test.

> What about strings that include more than 1 numeric segment: 
> abc2003def0001 and abc02003def001
> in the leading zero case?
>
With the first comparator (which treats the numbers with more leading 
zeroes as greater ones), these strings would be sorted as 
"abc2003def0001" < "abc02003def001".
The logic here is as following:
1) skip common prefix "abc",
2) find the numerical parts: "2003" and "02003",
3) after skipping leading zeroes, they are compared to be equal,
4) then, the string with more leading zeroes is said to be greater.

> Though the test adds the @key randomness, it would be useful to use 
> the test library
> mechanisms to report the seed and be able to run the test with a seed, 
> if case they fail.
> (As might be provided by the test library jdk.test.lib.RandomFactory).
>
Okay, I'll add this Random to the shuffling to make it reproducible.
>
> Comparator.java:
> 576: "the common prefix is skipped" is problematic when considering 
> leading zeros.
>    The common prefix may contain non-zero digits.
Yes, of course.
While scanning the string for the common prefix, we still keep track of 
the numeric part.
Probably "skip" is a wrong word here.

> 585: it is not clear whether the "different number of leading zeros" 
> is applied regardless
>    of the common prefix or only starting after the common prefix.
>
When talking about leading zeroes, then the entire numerical substring 
is meant.
Part of this substring can belong to the common prefix.

> 550, 586: for many readers, it is easier to read 'for example' than 
> "e.g." or "i.e.".
>
Thanks!  I'll change it accordingly.

> 562, 598:  editorial: "is to compare" -> "compares"
>
Thanks!
> Comparators.java:
>
> 192: @param for param o is missing;  (the param name "o" usually 
> refers to an object, not a string).
> 200:   Can skipLeadingZeros be coded to correctly work if cnt == 0; it 
> would be more robust
>      SkipLeading zeros works correctly only if pos is given the first 
> numeric digit in the subsequence
>      so the numStart1 and numStart2 must be first digit in each string.
I don't think that skipLeadingZeros() is very specific that it always 
called for longer string, trying to reduce its size via skipping leading 
zeroes.
It wouldn't make sense to call it with cnt == 0, and so we can avoid one 
initial comparing and save a couple of nanoseconds :)
I added this prerequisite to the javadoc of the method, so hopefully it 
shouldn't cause much confusion.

> compare():
>
> Line 223: if strings typically have non-numeric prefixes, it might 
> perform slightly better
> to detect the end of the common prefix and then scan back to find the 
> beginning of the numeric part.
> (Avoiding checking every char for isDigit).
>
On the other hand, we're saving on not calling String.charAt() for the 
second time :)

> Line 224: If assigned for every digit, it will hold the last equal 
> digit in the common prefix, not the first digit.
It will only be assigned when a non-digit is met.
And since the index was just incremented @ Line 222, numStart1 will be 
set to the index of the first digit inside the common prefix.

For example, if the common prefix is abc123, then the numStart1 will be 
updated to 3 when looking at the char 'c'.  Last three cycles of the 
loop it won't be updated, since all the trailing chars are digits.

>
> 239, 240:  chars at o1(index) and o2(index) are already known to be 
> digits, can it be avoided checking them twice
> by starting at index+1?
>
Not quite necessarily.  We can be here due to numLen1 > 0, and *only 
one* or *none* of the string remainders start with digits.
In the later case we'll end up @ line 279.

Here's the updated webrev:
http://cr.openjdk.java.net/~igerasim/8134512/03/webrev/

Given the further discussion this iteration is probably not the final, 
but it's better to have a checkpoint :)

With kind regards,
Ivan

> $.02, Roger
>
>
> On 7/19/2017 4:41 AM, Ivan Gerasimov wrote:
>> Hello!
>>
>> It is a proposal to provide a String comparator, which will pay 
>> attention to the numbers embedded into the strings (should they 
>> present).
>>
>> This proposal was initially discussed back in 2014 and seemed to 
>> bring some interest from the community:
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html 
>>
>>
>> In the latest webrev two methods are added to the public API:
>> j.u.Comparator.comparingNumerically() and
>> j.u.Comparator.comparingNumericallyLeadingZerosAhead().
>>
>> The regression test is extended to exercise this new comparator.
>>
>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
>> WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/
>>
>> Comments, suggestions are very welcome!
>>
>
>

-- 
With kind regards,
Ivan Gerasimov