RFR: 8316704: Regex-free parsing of Formatter and FormatProcessor specifiers
Raffaello Giulietti
rgiulietti at openjdk.org
Mon Oct 16 16:18:17 UTC 2023
On Sun, 17 Sep 2023 16:01:33 GMT, Shaojin Wen <duke at openjdk.org> wrote:
> @cl4es made performance optimizations for the simple specifiers of String.format in PR https://github.com/openjdk/jdk/pull/2830. Based on the same idea, I continued to make improvements. I made patterns like %2d %02d also be optimized.
>
> The following are the test results based on MacBookPro M1 Pro:
>
>
> -Benchmark Mode Cnt Score Error Units
> -StringFormat.complexFormat avgt 15 1862.233 ? 217.479 ns/op
> -StringFormat.int02Format avgt 15 312.491 ? 26.021 ns/op
> -StringFormat.intFormat avgt 15 84.432 ? 4.145 ns/op
> -StringFormat.longFormat avgt 15 87.330 ? 6.111 ns/op
> -StringFormat.stringFormat avgt 15 63.985 ? 11.366 ns/op
> -StringFormat.stringIntFormat avgt 15 87.422 ? 0.147 ns/op
> -StringFormat.widthStringFormat avgt 15 250.740 ? 32.639 ns/op
> -StringFormat.widthStringIntFormat avgt 15 312.474 ? 16.309 ns/op
>
> +Benchmark Mode Cnt Score Error Units
> +StringFormat.complexFormat avgt 15 740.626 ? 66.671 ns/op (+151.45)
> +StringFormat.int02Format avgt 15 131.049 ? 0.432 ns/op (+138.46)
> +StringFormat.intFormat avgt 15 67.229 ? 4.155 ns/op (+25.59)
> +StringFormat.longFormat avgt 15 66.444 ? 0.614 ns/op (+31.44)
> +StringFormat.stringFormat avgt 15 62.619 ? 4.652 ns/op (+2.19)
> +StringFormat.stringIntFormat avgt 15 89.606 ? 13.966 ns/op (-2.44)
> +StringFormat.widthStringFormat avgt 15 52.462 ? 15.649 ns/op (+377.95)
> +StringFormat.widthStringIntFormat avgt 15 101.814 ? 3.147 ns/op (+206.91)
For maintainability purposes, we should strive for simplicity and readability.
The parser looks good (but see the small proposed enhancements).
I'll continue with the rest tomorrow.
You might consider this alternative, which IMO is simpler and more readable.
int parse() {
// %[argument_index$][flags][width][.precision][t]conversion
// %(\d+$)?([-#+ 0,(<]*)?(\d+)?(.\d+)?([tT])?([a-zA-Z%])
parseArgument();
parseFlag();
parseWidth();
int precisionSize = parsePrecision();
if (precisionSize < 0) {
return 0;
}
// ([tT])?([a-zA-Z%])
char t = '\0', conversion = '\0';
if ((c == 't' || c == 'T') && off + 1 < max) {
char c1 = s.charAt(off + 1);
if (isConversion(c1)) {
t = c;
conversion = c1;
off += 2;
}
}
if (conversion == '\0' && isConversion(c)) {
conversion = c;
++off;
}
if (argSize + flagSize + widthSize + precisionSize + t + conversion != 0) {
if (al != null) {
FormatSpecifier formatSpecifier
= new FormatSpecifier(s, start, argSize, flagSize, widthSize, precisionSize, t, conversion);
al.add(formatSpecifier);
}
return off - start;
}
return 0;
}
private void parseArgument() {
// (\d+$)?
int i = off;
for (; i < max && isDigit(c = s.charAt(i)); ++i); // empty body
if (i == max || c != '$') {
c = first;
return;
}
++i; // skip '$'
if (i < max) {
c = s.charAt(i);
}
argSize = i - off;
off = i;
}
private void parseFlag() {
// ([-#+ 0,(<]*)?
int i = off;
for (; i < max && Flags.isFlag(c = s.charAt(i)); ++i); // empty body
flagSize = i - off;
off = i;
}
private void parseWidth() {
// (\d+)?
int i = off;
for (; i < max && isDigit(c = s.charAt(i)); ++i); // empty body
widthSize = i - off;
off = i;
}
private int parsePrecision() {
// (.\d+)?
if (c != '.') {
return 0;
}
int i = off + 1;
for (; i < max && isDigit(c = s.charAt(i)); ++i); // empty body
if (i == off + 1) { // a '.' but no digits
return -1;
}
int size = i - off;
off = i;
return size;
}
Meaningful external reviews take a _lot_ of engineering time on such a popular platform as the JDK.
They make sense only on somehow stable code, otherwise they are a waste of human resources.
So sure, take your time to stabilize your code, make your tests, but please refrain from keep on changing too much once you publish a PR. If reviewers have to face continuous changes, they might become uninterested in reviewing.
src/java.base/share/classes/java/util/Formatter.java line 2899:
> 2897: if (c == '.') {
> 2898: // (\.\d+)?
> 2899: precisionSize = parsePrecision();
`precisionSize == 0` has two different meanings: (1) there's a '.' not followed by digits and (2) there's no precision field at all.
In the proposed [alternative](https://github.com/openjdk/jdk/pull/15776#issuecomment-1737986458), case (1) is indicated by `-1`, while case (2) is indicated by `0`. Besides, the decision is taken in the `parsePrecision()` method, not partly here and partly in the method.
src/java.base/share/classes/java/util/Formatter.java line 2918:
> 2916: conversion = c;
> 2917: ++off;
> 2918: }
Suggestion:
} else if (isConversion(c)) {
conversion = c;
++off;
}
src/java.base/share/classes/java/util/Formatter.java line 2933:
> 2931: private void parseArgument() {
> 2932: // (\d+\$)?
> 2933: for (int size = 0; off < max; c = s.charAt(++off), size++) {
This argument_index parser is incorrect.
Say the invalid "specifier" is `"%12"`.
The parser would throw a `StringIndexOutOfBoundsException` on `s.charAt(++off)`.
src/java.base/share/classes/java/util/Formatter.java line 2934:
> 2932: // (\d+\$)?
> 2933: int i = off;
> 2934: for (; i < max && isDigit(c); c = next(++i)); // empty body
No need for `next()` if done as in the proposed alternative.
src/java.base/share/classes/java/util/Formatter.java line 2947:
> 2945: c = first;
> 2946: }
> 2947: }
I think the logic in the proposed alternative is more readable.
src/java.base/share/classes/java/util/Formatter.java line 2953:
> 2951: // ([-#+ 0,(\<]*)?
> 2952: int i = off;
> 2953: for (; i < max && Flags.isFlag(c); c = next(++i)); // empty body
No need for `next()` if done as in the proposed alternative.
src/java.base/share/classes/java/util/Formatter.java line 2961:
> 2959: // (\d+)?
> 2960: int i = off;
> 2961: for (; i < max && isDigit(c); c = next(++i)); // empty body
You are using `next()` here, but the simpler solution in `parsePrecision()`.
I think consistency simplifies understanding.
src/java.base/share/classes/java/util/Formatter.java line 2966:
> 2964:
> 2965: private int parsePrecision() {
> 2966: // (\.\d+)?
Since processing of `'.'` is done by the caller, the regex pattern in this comment is confusing.
Either prefer handling `'.'` here (see [this](https://github.com/openjdk/jdk/pull/15776#issuecomment-1737986458)), or adjust the regex in the comment.
src/java.base/share/classes/java/util/Formatter.java line 2978:
> 2976: }
> 2977:
> 2978: private char next(int off) {
There's no real need for this more expensive method, compared with a simple `s.charAt(i)`, in the alternative proposed [before](https://github.com/openjdk/jdk/pull/15776#issuecomment-1737986458).
src/java.base/share/classes/java/util/Formatter.java line 3136:
> 3134: int flagEnd = argEnd + flagSize;
> 3135: int widthEnd = flagEnd + widthSize;
> 3136: int precesionEnd = widthEnd + precisionSize;
Suggestion:
int precisionEnd = widthEnd + precisionSize;
-------------
PR Review: https://git.openjdk.org/jdk/pull/15776#pullrequestreview-1647449239
PR Review: https://git.openjdk.org/jdk/pull/15776#pullrequestreview-1660250320
PR Comment: https://git.openjdk.org/jdk/pull/15776#issuecomment-1737986458
PR Comment: https://git.openjdk.org/jdk/pull/15776#issuecomment-1739462969
PR Review Comment: https://git.openjdk.org/jdk/pull/15776#discussion_r1339852931
PR Review Comment: https://git.openjdk.org/jdk/pull/15776#discussion_r1347671680
PR Review Comment: https://git.openjdk.org/jdk/pull/15776#discussion_r1338962488
PR Review Comment: https://git.openjdk.org/jdk/pull/15776#discussion_r1339856693
PR Review Comment: https://git.openjdk.org/jdk/pull/15776#discussion_r1339856725
PR Review Comment: https://git.openjdk.org/jdk/pull/15776#discussion_r1339857403
PR Review Comment: https://git.openjdk.org/jdk/pull/15776#discussion_r1339844252
PR Review Comment: https://git.openjdk.org/jdk/pull/15776#discussion_r1347679848
PR Review Comment: https://git.openjdk.org/jdk/pull/15776#discussion_r1339842414
PR Review Comment: https://git.openjdk.org/jdk/pull/15776#discussion_r1339135086
More information about the core-libs-dev
mailing list