RFR: 8325002: Exceptions::fthrow needs to ensure it truncates to a valid utf8 string

Fri Jul 26 21:46:41 UTC 2024

On Fri, 26 Jul 2024 05:36:42 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> Exceptions::fthrow uses a 1024 byte buffer to format the incoming exception message string, but this may not be large enough, leading to truncation. However, we should ensure we truncate to a valid UTF8 sequence.
>> 
>> The process is explained in the code. Thanks to @RogerRiggs and @djelinski for their suggestions on how to tackle this.
>> 
>> Testing:
>>  - new gtest exercises the truncation code with the different possibilities for bad truncation
>>  - tiers 1-3 sanity testing
>> 
>> Thanks.
>
> This is a neat technique, but it won't work for very short strings (e.g. consisting of just one or two multi-byte sequences, the latter being invalid). Reason is that you need a minimal buffer length to do the check safely.
> 
> What you could do alternatively is to allocate the `msg` buffer with 5 leading bytes that you don't use, just zero-initialize. So the string start would be at msg+5. But that way, you can safely overstep the array with e.g. index - 5 without corruption.

Thanks for looking at this @tstuefe  and @djelinski

> src/hotspot/share/utilities/exceptions.cpp line 275:
> 
>> 273:   // we may also have a truncated UTF-8 sequence. In such cases we need to fix the buffer so the UTF-8
>> 274:   // sequence is valid.
>> 275:   if ((ret == -1 || ret >= max_msg_size) && strlen(msg) > 0) {
> 
> You should test for length >= 5 since it is the farthest you could access in `UTF8::truncate_to_legal_utf8` later:
> 
> 
>   for (int index = length - 2; index > 0; index--) {
> ...
>     assert(buffer[index - 3] == 0xED, "malformed sequence");

I will update `truncate_to_legal_utf` to ensure we check for small buffers - though of course we would never expect to pass such a buffer to it in the first place.

> src/hotspot/share/utilities/exceptions.cpp line 277:
> 
>> 275:   if ((ret == -1 || ret >= max_msg_size) && strlen(msg) > 0) {
>> 276:     assert(msg[max_msg_size - 1] == '\0', "should be null terminated");
>> 277:     UTF8::truncate_to_legal_utf8((unsigned char*)msg, max_msg_size);
> 
> Ah, I misread your patch and thought you pass in the strlen of the message to the truncation function, when in fact you pass in the hard coded message buffer size. 
> 
> But that begs the question of why you test strlen above, and more importantly, whether all cases where snprintf returns an error are truncation problems. It could have detected an invalid UTF8 sequence and aborted in the middle of it.

The `strlen` check is to skip the empty buffer you can get on Windows if vsnprintf returns -1 due to overflow of INT_MAX.

We are assuming/requiring that we start with a valid UTF8 sequence and the worst that will happen is that vsnprintf will truncate it.

If we actually got -1 for a conversion error (no way to tell the difference in the two cases) then we would unnecessarily truncate, but we do not expect any such conversion errors - in part because we type check the format specifiers and args and so should never get a mismatch.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20345#issuecomment-2253549010
PR Review Comment: https://git.openjdk.org/jdk/pull/20345#discussion_r1693627660
PR Review Comment: https://git.openjdk.org/jdk/pull/20345#discussion_r1693627154