Small survey about JDK-8280101: in String.split grouped regex should keep the delimiter

Raffaello Giulietti raffaello.giulietti at oracle.com
Fri Mar 31 14:20:27 UTC 2023


Hi Roger,

I agree that Pattern.splitStream() would be hard use without contextual 
information, so let's drop it from the (still to be written) CSR.

As for the guarantee that the array contains an alternation of n 
substrings and n-1 delimiters, this is what the overload

     String[] split(String regex, int limit, boolean withDelimiters)

is planned to do when passing a negative value as limit.
This is similar to what split(regex,limit) already does today:

"abc".split("\\n", -1) -> String[1] { "abc" }
"\nabc".split("\\n", -1) -> String[2] { "", "abc" }
"abc\n".split("\\n", -1) -> String[2] { "abc", "" }
"\nabc\n".split("\\n", -1) -> String[3] { "", "abc", "" }




On 2023-03-31 15:53, Roger Riggs wrote:
> Hi Raffaello,
> 
> It sounds useful to return the delimiters in a new API.
> It might be interesting to guarantee the array returns n strings and n-1 
> delimiters; filling with an empty string at the beginning and end if the 
> input starts with or ends with a delimiter.
> Similar to the construction of TemplatedStrings (JDP 430) that has a 
> predictable number of strings (n+1) and expression values (n).
> 
> An overload of Pattern.splitStream() that alternates would be hard to 
> use since in the stream, it would be hard to distinguish between the 
> delimiters and the strings. Someday, there might be support in streams 
> for grouping.
> 
> Regards, Roger
> 
> 
> 
> 
> On 3/31/23 8:18 AM, Raffaello Giulietti wrote:
>> HI,
>>
>> JBS issue JDK-8280101 [0] proposes to add functionality to 
>> String.split() to behave more like the perl equivalent. Rather than 
>> returning only the substrings resulting from the split, the perl 
>> implementation can return an alternation of the substrings and the 
>> matched delimiters when the delimiter pattern is grouped. Because of 
>> the non-negligible behavioral change this would imply in the JDK 
>> implementation and the impact on existing client code, it cannot be 
>> done as proposed by the issue reporter.
>>
>> However, since implementing the requested behavior outside the JDK is 
>> rather tricky, it would make sense to add an overload of 
>> String.split() that returns the result described in the JBS issue, 
>> that is, an alternation of substrings and delimiters. As a 
>> consequence, a similar overload would be needed in 
>> java.util.regex.Pattern as well, where the bulk of the implementation 
>> underlying String.split() is located. Further, an overload of 
>> Pattern.splitStream() is probably needed as well. Note that both 
>> String and Pattern are final classes, so the overloads are safe to add.
>>
>> As mentioned, the reason to add these overloads to the JDK is because 
>> it is somehow complicated to implement that behavior outside class 
>> Pattern. The implementation of the extensions in the JDK, on the 
>> contrary, looks rather simple. But before preparing a PR and a CSR, 
>> I'd like to gather more opinions.
>>
>> WDYT?
>>
>>
>> Greetings
>> Raffaello
>>
>> ----
>>
>> [0] https://bugs.openjdk.org/browse/JDK-8280101
> 


More information about the core-libs-dev mailing list