8288298: Resolve parsing ambiguities in UL
John Rose
john.r.rose at oracle.com
Tue Oct 8 19:11:58 UTC 2024
Full disclosure up front: I dislike output formats which are 99%
parseable, but fail to design for full disambiguation of all outputs.
We have some of this sort of technical debt in UL, and we should just
fix it. That way tool vendors will stop bumping into this kind of thing.
I’m going to give a number of opinions toward this goal. As a group,
they are the best way I know (in this moment) to comprehensively fix all
the parsing problems. Many of the opinions align with present UL
realities (happily) and I hope we can adjust remaining UL realities to
reach 100% unambiguous parsing.
A. Decorators must be delimited in such a way that they cannot be
confused with each other or with the following message line.
A1. Therefore, decorator text must never contain the end-decorator
character.
A1a. For most robust, simple parsing, decorator text should not contain
any other relevant delimiter character: Begin decorator, begin message,
newline. (Yes, allowing newline would still allow decorators to be
parsed but imagine the problems. Just forbid any of “[] \n”.)
A1b. To avoid off-by-one problems (and also do a good deed for
multi-line outputs), decorator text must also be non-empty. So no
“[][2s] hello!”
A2. Therefore, message text must never begin with the begin-decorator
character.
A2a. Message text should begin with its own delimiter, not just “any
character other than begin-decorator”. We use space today; good; this
lets simple word-splitting isolate the message (as long as decorators
cannot contain “ “).
B. Message text should be terminated by a newline, but should not be
subject to any other parsing requirement. Once you split at the first
space, you have your message line, with no further decoding.
B1. Thus, the occasional introduction of doubled backslashes and
backslash-newline is a bad idea. It just introduces more ambiguities.
(“Which grammar was I parsing? Oh, THAT one!?”)
B2. If message text contains embedded newlines, they should be
unambiguously marked, so that the newline that terminates the whole
message can be found.
C. In the setting of UL, the best way to mark a continuation line is to
vary the syntax at the BEGINNING of the following line, not the END of
the preceding line. This is because UL already has heavy parsing
activity at the beginnings of lines; there is no good reason to add more
parsing activity elsewhere.
C1. The format for a continuation line should be some decorator-like
syntax that is not exactly legal as a real decorator, and so cannot be
confused with it. Something like “[] second line” or “[ ] second
line” or the like. If it were “[ ]” (begin-decorator, space,
end-decorator) then a buggy line-split that was forgetting to look for
continuation lines would produce “] second line” as the message,
which is a good clue about what went wrong.
D. UL lines, along with their associated continuation lines, should
never interrupt each other. Concurrent output should be arranged so
that each line (with its continuation lines) precedes or follows (does
not interrupt) a neighboring lines (along with THEIR associated
continuation lines).
D1. If continuation lines are very difficult to keep with their leading
UL lines, then we should consider adjusting the syntax to allow
decorations which help match up a line with its continuations. This
seems to require an ID number, which ideally be given a characteristic
syntax distinct from other decorators. Something like “[#1]” and
with “#” illegal for other decorators (see A1a above).
E. UL is designed both human readers and mechanical parsers. The above
points support mechanical parsers, including very simple ones, and do
not impair human readers either.
Examples (without ID numbers):
```
[foo][bar] this is the first line
[ ] and this is the second
[ ] and this is the third
[not a decorator] this line has no decorators, and stands alone
this is the first of two, again without decorators
[ ] this is the promised second
```
Note there is no “\n” or “\\”. Those complicate parsers and are
hard to read by humans as well.
With ID numbers (which link together multi-line messages):
```
[foo][bar][#42] this is the first line
[ ][#42] and this is the second
[not a decorator] this line has no decorators, and stands alone
[#99]this is the first of two, again without decorators
[ ][#99] this is the promised second
[ ][#42] and this is the third (for the first line; it got lost in
concurrency)
```
Here are some possible regexes:
```
// Regexes to recognize and strip decorations.
public static final String DECORATOR_CHAR = "[^] \n]";
public static final String ONE_DECORATION = "\\[(" + DECORATOR_CHAR
+ "+)\\]";
public static final String DECORATION_PREFIX = "\\A(" +
ONE_DECORATION + ")* ?";
public static final String FIRST_DECORATION = "\\A" +
ONE_DECORATION;
public static final String SEQUENCE_ID = "\\[#[0-9]+\\]";
public static final String CONTINUATION_PREFIX = "\\A\\[ \\](" +
SEQUENCE_ID + ")? ?";
```
For simplicity the syntax allows decorators which look like sequence
IDs, but they should not be emitted, unless they really are sequence
IDs.
Test code: https://cr.openjdk.org/~jrose/scripts/LogStripTest.java.txt
I hope this helps. Thanks for working on this stuff, it’s important.
— John
On 7 Oct 2024, at 6:50, Anton Seoane Ampudia wrote:
> Hi all,
>
> During the migration of compiler logs to the UnifiedLogging framework,
> I have observed that multiline logging does not include decorators for
> all the lines, instead only adding them for the first one and leaving
> the rest “dangling”. I have found out that this is already a
> reported issue in
> JDK-8288298<https://bugs.openjdk.org/browse/JDK-8288298> , and written
> a tentative fix for it.
>
> Some initial testing has been yielding insignificant performance
> changes with normal logging use cases, but before going forward with
> it I would like to request for comments and opinions on this. As far
> as I know, it would simplify somewhat “manual reading” of logs, as
> everything starts right now in the same column, as well as automated
> parsing as there would be no line ambiguities. Copying from the JBS
> description:
>
>> log_info(gc)("A\nB"); currently outputs:
>> [0s][gc] A
>> B
>> And after this change will output:
>> [0s][gc] A
>> [1s][gc] B
>>
>> This change allows UL to be parsed by regex. Example for per-line
>> parsing:
>>
>> ^\[ [^\[\]]* \] \[ [^\[\]]* \] (\[ [^\[\]]* \])?
>
> It is worth mentioning that the special case with
> -Xlog:foldmultilines=true is not affected by this (i.e., if
> foldmultilines is set to true we do not carry out the line-by-line
> decorating).
>
> Thanks,
> Antón
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-dev/attachments/20241008/c0918a10/attachment-0001.htm>
More information about the hotspot-dev
mailing list