RFR: 8288298: Resolve multiline message parsing ambiguities in UL

Johan Sjölen jsjolen at openjdk.org
Thu Oct 24 09:14:42 UTC 2024


On Mon, 14 Oct 2024 14:15:46 GMT, Antón Seoane <duke at openjdk.org> wrote:

> Tools such as JITWatch parse OpenJDK logs (e.g. logs generated from `LogCompilation`) in order to present interesting data to users. In order for these tools to work reliably, UnifiedLogging (UL) needs to have a consistent output scheme.
> 
> Currently, a UL log message can be formatted in such a way that it looks like a UL decorator prefix, causing issues with parsing. This is because logging functions in UL do not prepend decorators (or any kind of prefix) to newlines in a log message. For example, `log_info(gc)("A\nB");` currently outputs
> 
> [0s][gc] A
> B
> 
> and we could mistakenly interpret `B` as a decorator. Additionally, developers may introduce pseudo-decorators (something that looks like a decorator but is actually part of the log message), yielding an incorrect parse. As a side remark, but also relevant, we hinder human readability when logs appear suddenly skewed at some points.
> 
> The UL framework should offer a robust way to (a) distinguish decorators from messages, and (b) unambiguously group multiline output. This PR aims to achieve both goals through a subset of the changes proposed in [this mail](https://mail.openjdk.org/pipermail/hotspot-dev/2024-October/095284.html), from which the rest of the ideas can be done as future work. This means:
> 
> - Decorators cannot contain the symbols `[` or `]`. The special decorator `[ ]` (with a variable number of whitespace between the brackets) is reserved.
> - We separate decorators from log messages via the first space after a closing bracket in the line.
> - A log message can contain any kind of symbols, and ends with a newline (except in the case for multiline messages).
> - We prepend multiline logging (such as in the example above) with the invalid decorator `[ ]`. The invalid decorator is as wide as the indentation of the rest of the log for easy visual reading. For example:
> 
> [0s][gc] single-line message
> [1s][gc] another single line message
> [2s][gc] first line of a multiline message
> [      ] second and last line of a multiline message
> [3s][gc] another single line message
> 
> Note how this is both unambiguously parseable and human readable.
> 
> For the case where decorators have been disabled, the aforementioned points do not apply (i.e., behaviour remains the same as before). This means no multiline logical connection (such as the one presented above) and no way to separate decorators (the empty set, in this case) from messages. This is intentional as users specifying no decorators expect "raw" output. Additionally, it i...

The proposal itself seems OK. Do we have a regex or something to parse this? I need to think about its unambiguity a bit closer.

It would technically be OK for decorators to have spaces inside of them (and even newlines?) as long as it's not *only* spaces or newlines, right? Anyway, it doesn't particularly matter, the restrictions as set out are fine.

A few code questions, I'll get back to you later on.

The performance degradation is caused by three things:

1. Walking the string with `strlen`
2. Copying the string with `strdup` (or `memcopy` soon)
3. Walk the string to switch out newlines with nul bytes

I happen to know that both steps 1 and 2 are unnecessary, and that re-architecting UL slightly let's us avoid these steps here entirely. We already know the length up the call chain and these strings are actually not `const`.

The third step is necessary, but is also amenable to vectorization.

src/hotspot/share/logging/logFileStreamOutput.cpp line 168:

> 166: 
> 167: int LogFileStreamOutput::write_internal(const LogDecorations& decorations, const char* msg) {
> 168:   int msg_len = (int)strlen(msg);

Not OK with casting this. Add some sort of overflow check instead and only then `static_cast`.

src/hotspot/share/logging/logFileStreamOutput.cpp line 170:

> 168:   int msg_len = (int)strlen(msg);
> 169: 
> 170:   // Do not do anything if foldmultilines has been specified

But we do something, right? Maybe "don't alter the string if _fold_multilines has been specified". Please add a corresponding explanation to the other case, probably something mentioning "canonical representation" or something of the sort.

src/hotspot/share/logging/logFileStreamOutput.cpp line 175:

> 173:   // and then force write_internal_line to print all of them (i.e. not stopping at the
> 174:   // first null but until msg_len bytes are printed)
> 175:   char* dupstr = os::strdup_check_oom(msg, mtLogging);

Don't `strdup`. You have the length, so use `os::malloc` and `msg_len`, check for null yourself. If we are OOM, then drop the message (maybe, or just log it anyway).

src/hotspot/share/logging/logFileStreamOutput.cpp line 178:

> 176:   char* tmp = dupstr;
> 177: 
> 178:   while (*tmp != '\0') {

Use `msg_len` to loop instead of checking for nul byte.

src/hotspot/share/logging/logFileStreamOutput.cpp line 186:

> 184:   os::free(dupstr);
> 185: 
> 186:   return written;

I don't think we need to dupstr this, do we? Can't we do the changes in the original string instead?

-------------

PR Review: https://git.openjdk.org/jdk/pull/21501#pullrequestreview-2367413843
PR Review: https://git.openjdk.org/jdk/pull/21501#pullrequestreview-2373248750
PR Review Comment: https://git.openjdk.org/jdk/pull/21501#discussion_r1799960883
PR Review Comment: https://git.openjdk.org/jdk/pull/21501#discussion_r1799962607
PR Review Comment: https://git.openjdk.org/jdk/pull/21501#discussion_r1803569676
PR Review Comment: https://git.openjdk.org/jdk/pull/21501#discussion_r1803557325
PR Review Comment: https://git.openjdk.org/jdk/pull/21501#discussion_r1799963226


More information about the hotspot-runtime-dev mailing list