[10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

joe darcy joe.darcy at oracle.com
Fri Jul 28 16:48:09 UTC 2017


Hi Ivan,

A few comments.

I don't have a specific suggestion for a different name, but I don't 
think "comparingNumerically" does an adequate job capturing the 
described behavior of the method. Likewise, the summary of the methods' 
behavior in the javadoc should have a more suggestive description of the 
behavior.

Please add comments describing what the "// Fullwidth characters" values 
are in the test.

Interesting test cases would also include substrings for 
Integer.MAX_VALUE, Integer.MAX_VALUE+1, Long.MAX_VALUE, Long.MAX_VALUE + 
1, etc. Test cases of purely numerical and purely alphabetical inputs 
would be warranted too.

While randomizing the array and sorting is a valid test, a more through 
test would do pair-wise comparisons of each array element. While such 
comparisons are quadratic with array length, the arrays here are small.

Cheers,

-Joe


On 7/28/2017 9:03 AM, Ivan Gerasimov wrote:
> Hi Roger!
>
>
> On 7/24/17 7:42 AM, Roger Riggs wrote:
>> Hi Ivan,
>>
>> Thanks for bringing this up again.
>>
>> Some initial comments, not a full review.
>>
>> Though the enhancement says that no consideration is given to sign 
>> characters they
>> may produce puzzling results for strings like "-2000" and "-2001" 
>> where the strings appear
>> to be signed numbers but the comparison will be for the "-" prefix 
>> and the rest unsigned.
>> Including the word 'unsigned' in the description might help reinforce 
>> the semantics.
>>
> Yes, it's a good point.  I'll add some words to make it certain we 
> only recognize unsigned numbers.
> Otherwise, it would become confusing when comparing something like 
> "2017-07-28" and "2017-07-29".
>
>> Also, I don't see test cases for strings like: "abc200123" and 
>> "abc20000123" which share
>> a prefix part of which is numeric but differ in the number of 
>> "leading" zeros after the prefix.
>>
> Sure.  Good addition to the test.
>
>> What about strings that include more than 1 numeric segment: 
>> abc2003def0001 and abc02003def001
>> in the leading zero case?
>>
> With the first comparator (which treats the numbers with more leading 
> zeroes as greater ones), these strings would be sorted as 
> "abc2003def0001" < "abc02003def001".
> The logic here is as following:
> 1) skip common prefix "abc",
> 2) find the numerical parts: "2003" and "02003",
> 3) after skipping leading zeroes, they are compared to be equal,
> 4) then, the string with more leading zeroes is said to be greater.
>
>> Though the test adds the @key randomness, it would be useful to use 
>> the test library
>> mechanisms to report the seed and be able to run the test with a 
>> seed, if case they fail.
>> (As might be provided by the test library jdk.test.lib.RandomFactory).
>>
> Okay, I'll add this Random to the shuffling to make it reproducible.
>>
>> Comparator.java:
>> 576: "the common prefix is skipped" is problematic when considering 
>> leading zeros.
>>    The common prefix may contain non-zero digits.
> Yes, of course.
> While scanning the string for the common prefix, we still keep track 
> of the numeric part.
> Probably "skip" is a wrong word here.
>
>> 585: it is not clear whether the "different number of leading zeros" 
>> is applied regardless
>>    of the common prefix or only starting after the common prefix.
>>
> When talking about leading zeroes, then the entire numerical substring 
> is meant.
> Part of this substring can belong to the common prefix.
>
>> 550, 586: for many readers, it is easier to read 'for example' than 
>> "e.g." or "i.e.".
>>
> Thanks!  I'll change it accordingly.
>
>> 562, 598:  editorial: "is to compare" -> "compares"
>>
> Thanks!
>> Comparators.java:
>>
>> 192: @param for param o is missing;  (the param name "o" usually 
>> refers to an object, not a string).
>> 200:   Can skipLeadingZeros be coded to correctly work if cnt == 0; 
>> it would be more robust
>>      SkipLeading zeros works correctly only if pos is given the first 
>> numeric digit in the subsequence
>>      so the numStart1 and numStart2 must be first digit in each string.
> I don't think that skipLeadingZeros() is very specific that it always 
> called for longer string, trying to reduce its size via skipping 
> leading zeroes.
> It wouldn't make sense to call it with cnt == 0, and so we can avoid 
> one initial comparing and save a couple of nanoseconds :)
> I added this prerequisite to the javadoc of the method, so hopefully 
> it shouldn't cause much confusion.
>
>> compare():
>>
>> Line 223: if strings typically have non-numeric prefixes, it might 
>> perform slightly better
>> to detect the end of the common prefix and then scan back to find the 
>> beginning of the numeric part.
>> (Avoiding checking every char for isDigit).
>>
> On the other hand, we're saving on not calling String.charAt() for the 
> second time :)
>
>> Line 224: If assigned for every digit, it will hold the last equal 
>> digit in the common prefix, not the first digit.
> It will only be assigned when a non-digit is met.
> And since the index was just incremented @ Line 222, numStart1 will be 
> set to the index of the first digit inside the common prefix.
>
> For example, if the common prefix is abc123, then the numStart1 will 
> be updated to 3 when looking at the char 'c'.  Last three cycles of 
> the loop it won't be updated, since all the trailing chars are digits.
>
>>
>> 239, 240:  chars at o1(index) and o2(index) are already known to be 
>> digits, can it be avoided checking them twice
>> by starting at index+1?
>>
> Not quite necessarily.  We can be here due to numLen1 > 0, and *only 
> one* or *none* of the string remainders start with digits.
> In the later case we'll end up @ line 279.
>
> Here's the updated webrev:
> http://cr.openjdk.java.net/~igerasim/8134512/03/webrev/
>
> Given the further discussion this iteration is probably not the final, 
> but it's better to have a checkpoint :)
>
> With kind regards,
> Ivan
>
>> $.02, Roger
>>
>>
>> On 7/19/2017 4:41 AM, Ivan Gerasimov wrote:
>>> Hello!
>>>
>>> It is a proposal to provide a String comparator, which will pay 
>>> attention to the numbers embedded into the strings (should they 
>>> present).
>>>
>>> This proposal was initially discussed back in 2014 and seemed to 
>>> bring some interest from the community:
>>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html 
>>>
>>>
>>> In the latest webrev two methods are added to the public API:
>>> j.u.Comparator.comparingNumerically() and
>>> j.u.Comparator.comparingNumericallyLeadingZerosAhead().
>>>
>>> The regression test is extended to exercise this new comparator.
>>>
>>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
>>> WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/
>>>
>>> Comments, suggestions are very welcome!
>>>
>>
>>
>



More information about the core-libs-dev mailing list