RFR: 8338257: UTF8 lengths should be size_t not int [v7]

Thomas Stuefe stuefe at openjdk.org
Thu Aug 29 13:09:22 UTC 2024


On Thu, 29 Aug 2024 12:37:47 GMT, David Holmes <dholmes at openjdk.org> wrote:

> > > > Many of these translations seem awkward, since they convert to size_t only to then convert back to int.
> > > 
> > > 
> > > Can you be more specific here please?
> > 
> > 
> > Certainly. For example, this construct:
> > ```
> >     size_t utf8_len = static_cast<size_t>(length);
> >     const char* base = UNICODE::as_utf8(position, utf8_len);
> >     Symbol* sym = SymbolTable::new_symbol(base, checked_cast<int>(utf8_len));
> > ```
> > 
> > 
> >     
> >       
> >     
> > 
> >       
> >     
> > 
> >     
> >   
> > We introduce `utf8_len` as a `size_t` synonym for len, but since it originates from an `int`, its length must be <= INT_MIN. We also assume it is >=0, but we don't check. We feed `utf8_len` into both `UNICODE::as_utf8` and `SymbolTable::new_symbol`. The former takes a size_t, but since we rely on `length` >= 0, we could just as well give it `length`. For `SymbolTable::new_symbol`, we translate `utf8_len` back to `int`, with a check. The check feels superfluous since `utf8_len` came from an int.
> > I assume this verbosity is for the benefit of the code reader, to make intent clear. Otherwise, we could have just continued to use length, and just cast it on the fly to unsigned or to size_t when calling `UNICODE::as_utf8`.
> 
> This is exactly the case I was referring to. The declaration here is:
> 
> ```
>  template<typename T> static char* as_utf8(const T* base, size_t& length);
> ```
> 
> whether length is an IN/OUT parameter that is the int array length going in (hence >= 0 and <= INT_MAX), and the size_t utf8 sequence length coming out. The out coming utf8 length can theoretically by > INT_MAX but if that were the case in this code (which expects to be dealing with names that can be symbols hence < 64K) then that would be a programming error which the checked_cast would catch. And of course new_symbol checks for < 64K.

Oh, I completely missed that!

I wish we would use pointers instead of references in cases like these since that would make the intent immediately clear when looking at the call site.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20560#issuecomment-2317607555


More information about the serviceability-dev mailing list