8288298: Resolve parsing ambiguities in UL

Tue Oct 8 19:11:58 UTC 2024

Full disclosure up front:  I dislike output formats which are 99% 
parseable, but fail to design for full disambiguation of all outputs.  
We have some of this sort of technical debt in UL, and we should just 
fix it. That way tool vendors will stop bumping into this kind of thing.

I’m going to give a number of opinions toward this goal.  As a group, 
they are the best way I know (in this moment) to comprehensively fix all 
the parsing problems.  Many of the opinions align with present UL 
realities (happily) and I hope we can adjust remaining UL realities to 
reach 100% unambiguous parsing.

A. Decorators must be delimited in such a way that they cannot be 
confused with each other or with the following message line.

A1. Therefore, decorator text must never contain the end-decorator 
character.

A1a. For most robust, simple parsing, decorator text should not contain 
any other relevant delimiter character:  Begin decorator, begin message, 
newline.  (Yes, allowing newline would still allow decorators to be 
parsed but imagine the problems.  Just forbid any of “[] \n”.)

A1b. To avoid off-by-one problems (and also do a good deed for 
multi-line outputs), decorator text must also be non-empty.  So no 
“[][2s] hello!”

A2. Therefore, message text must never begin with the begin-decorator 
character.

A2a. Message text should begin with its own delimiter, not just “any 
character other than begin-decorator”.  We use space today; good; this 
lets simple word-splitting isolate the message (as long as decorators 
cannot contain “ “).

B. Message text should be terminated by a newline, but should not be 
subject to any other parsing requirement.  Once you split at the first 
space, you have your message line, with no further decoding.

B1. Thus, the occasional introduction of doubled backslashes and 
backslash-newline is a bad idea.  It just introduces more ambiguities.  
(“Which grammar was I parsing? Oh, THAT one!?”)

B2. If message text contains embedded newlines, they should be 
unambiguously marked, so that the newline that terminates the whole 
message can be found.

C. In the setting of UL, the best way to mark a continuation line is to 
vary the syntax at the BEGINNING of the following line, not the END of 
the preceding line.  This is because UL already has heavy parsing 
activity at the beginnings of lines; there is no good reason to add more 
parsing activity elsewhere.

C1. The format for a continuation line should be some decorator-like 
syntax that is not exactly legal as a real decorator, and so cannot be 
confused with it.  Something like “[] second line” or “[ ] second 
line” or the like.  If it were “[ ]” (begin-decorator, space, 
end-decorator) then a buggy line-split that was forgetting to look for 
continuation lines would produce “] second line” as the message, 
which is a good clue about what went wrong.

D. UL lines, along with their associated continuation lines, should 
never interrupt each other.  Concurrent output should be arranged so 
that each line (with its continuation lines) precedes or follows (does 
not interrupt) a neighboring lines (along with THEIR associated 
continuation lines).

D1. If continuation lines are very difficult to keep with their leading 
UL lines, then we should consider adjusting the syntax to allow 
decorations which help match up a line with its continuations.  This 
seems to require an ID number, which ideally be given a characteristic 
syntax distinct from other decorators.  Something like “[#1]” and 
with “#” illegal for other decorators (see A1a above).

E. UL is designed both human readers and mechanical parsers.  The above 
points support mechanical parsers, including very simple ones, and do 
not impair human readers either.

Examples (without ID numbers):

```
[foo][bar] this is the first line
[ ] and this is the second
[ ] and this is the third
  [not a decorator] this line has no decorators, and stands alone
  this is the first of two, again without decorators
[ ] this is the promised second
```

Note there is no “\n” or “\\”.  Those complicate parsers and are 
hard to read by humans as well.

With ID numbers (which link together multi-line messages):

```
[foo][bar][#42] this is the first line
[ ][#42] and this is the second
  [not a decorator] this line has no decorators, and stands alone
  [#99]this is the first of two, again without decorators
[ ][#99] this is the promised second
[ ][#42] and this is the third (for the first line; it got lost in 
concurrency)
```

Here are some possible regexes:

```
     // Regexes to recognize and strip decorations.
     public static final String DECORATOR_CHAR = "[^] \n]";
     public static final String ONE_DECORATION = "\\[(" + DECORATOR_CHAR 
+ "+)\\]";
     public static final String DECORATION_PREFIX = "\\A(" + 
ONE_DECORATION + ")* ?";
     public static final String FIRST_DECORATION = "\\A" + 
ONE_DECORATION;
     public static final String SEQUENCE_ID = "\\[#[0-9]+\\]";
     public static final String CONTINUATION_PREFIX = "\\A\\[ \\](" + 
SEQUENCE_ID + ")? ?";
```

For simplicity the syntax allows decorators which look like sequence 
IDs, but they should not be emitted, unless they really are sequence 
IDs.

Test code: https://cr.openjdk.org/~jrose/scripts/LogStripTest.java.txt

I hope this helps.  Thanks for working on this stuff, it’s important.

— John

On 7 Oct 2024, at 6:50, Anton Seoane Ampudia wrote:

> Hi all,
>
> During the migration of compiler logs to the UnifiedLogging framework, 
> I have observed that multiline logging does not include decorators for 
> all the lines, instead only adding them for the first one and leaving 
> the rest “dangling”. I have found out that this is already a 
> reported issue in 
> JDK-8288298<https://bugs.openjdk.org/browse/JDK-8288298> , and written 
> a tentative fix for it.
>
> Some initial testing has been yielding insignificant performance 
> changes with normal logging use cases, but before going forward with 
> it I would like to request for comments and opinions on this. As far 
> as I know, it would simplify somewhat “manual reading” of logs, as 
> everything starts right now in the same column, as well as automated 
> parsing as there would be no line ambiguities. Copying from the JBS 
> description:
>
>> log_info(gc)("A\nB"); currently outputs:
>> [0s][gc] A
>> B
>> And after this change will output:
>> [0s][gc] A
>> [1s][gc] B
>>
>> This change allows UL to be parsed by regex. Example for per-line 
>> parsing:
>>
>> ^\[ [^\[\]]* \] \[ [^\[\]]* \] (\[ [^\[\]]* \])?
>
> It is worth mentioning that the special case with 
> -Xlog:foldmultilines=true is not affected by this (i.e., if 
> foldmultilines is set to true we do not carry out the line-by-line 
> decorating).
>
> Thanks,
> Antón
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-dev/attachments/20241008/c0918a10/attachment-0001.htm>