RFR(xs): 8251255: [linux] Add process-memory information to hs-err and VM.info

Mon Aug 10 09:04:47 UTC 2020

Hi David,

On Mon, Aug 10, 2020 at 10:11 AM David Holmes <david.holmes at oracle.com>
wrote:

> Hi Thomas,
>
> On 10/08/2020 4:13 pm, Thomas Stüfe wrote:
> > On Mon, Aug 10, 2020 at 6:50 AM David Holmes <david.holmes at oracle.com
> > <mailto:david.holmes at oracle.com>> wrote:
> >
> >     Hi Thomas,
> >
> >     On 7/08/2020 12:01 am, Thomas Stüfe wrote:
> >      > Hi,
> >      >
> >      > may I please have reviews for this addition:
> >      >
> >      > JBS: https://bugs.openjdk.java.net/browse/JDK-8251255
> >      > Webrev:
> >      >
> >
> http://cr.openjdk.java.net/~stuefe/webrevs/8251255-add-rss-to-hs-err/webrev.00/webrev/
> >      >
> >      > When analysing a customer memory leak recently I found for all
> >      > the information we have in the error files we miss some really
> >     fundamental
> >      > statistics like process virtual size and rss. Either that or I am
> >     blind :)
> >
> >     This doesn't seem unreasonable, though I don't like to see continued
> >     divergence between what gets reported on Linux versus other
> platforms.
> >
> >
> > Neither do I but Linux is kind of the most important one so I do not
> > feel bad for enhancing only this; and people interested in other
> > platforms can always step in.
>
> Ok.
>
> > My problem here is while adding similar printouts for other platforms is
> > trivial, I do not like printing out stuff I do not have investigated and
> > fully understood / tested before. Which would take more time than I have
> > right now to do correctly for all platforms. See also below, my thoughts
> > about mallinfo.
>
> Sure I understand.
>
> >     Also is this something that is reasonable to always do or should it
> be
> >     added to the set of things under ExtensiveErrorReports?
> >
> >
> > Always, IMHO. There is a lot of other stuff less important which we may
> > shove into ExtensiveErrorReports, but this is very basic.
>
> Ok.
>
> >
> >      > In case I am not blind this patch adds those statistics. I
> >     decided against
> >      > dumping proc files wholesale to the hs-err file, I just print out
> >     what I
> >      > really find useful.
> >
> >     +   FILE* f = ::fopen("/proc/self/status", "r");
> >     +   char buf[256];
> >     +   while (::fgets(buf, sizeof(buf), f) != NULL && num_found <
> >     num_values) {
> >
> >     But without counting the characters (and knowing what is fixed-width
> >     and
> >     what is variable) doesn't this risk getting a partial entry at the
> end
> >     of the buffer such that you won't see it when trying to parse?
> >
> >
> > Not sure if I understand your question. If you mean what happens should
> > the proc file contain a line longer than 256, that is not a problem:
> > fgets stops either on newline or when the buffer is full. We would read
> > the first 256 characters, scan, then the next characters (needlessly),
> > scan (needlessly, not finding anything) and so forth, until we hit a
> > newline. The subsequent call to fgets will start at the next line.
>
> Sorry I was thinking fgets would read 256 chars at a time, not a line of
> up to 256 chars.
>
> > Also, the proc file content is no mystery: it is clearly documented in
> > the kernel man pages (see section about /proc/[pid]/status). So I do not
> > have to assume an arbitrary input.
> >
> >
> >      > Note that among other things I print out the total size of
> >     outstanding heap
> >      > allocations. Note that still the best way to do this is via
> >     mallinfo, and
> >      > that means the returned value is int and may wrap, see coding and
> >     comment.
> >      > Even with this caveat though this info is still very useful.
> >
> >     I am less keen on this bit. Now I understand your query in another
> RFR
> >     about how to flag such glibc'isms so they don't impact building
> against
> >     different libs like musl. :)
> >
> >
> > Funny enough, I read somewhere that mallinfo is a Posix function; I see
> > it mentioned in very old documents from the 80s. But OpenGroup has no
> > mention of it.
>
> I read (Linux manpage) it is from System V Interface Definition but I
> don't see it in the 4th edition PDF.
>
> > About muslc, it is highly probable that it would implement mallinfo(),
>
> No it doesn't and won't as it is considered fundamentally broken due to
> the int issue.
>
> https://www.openwall.com/lists/musl/2018/01/17/2

Good catch, I'll try to work around that.

>
>
> > but the meaning of the content may be different, so the printout should
> > be adjusted for muslc. Same goes probably for all other *nix platforms -
> > we could print out all members, but one should investigate beforehand if
> > their content makes sense and what it means.
> >
> >     Is mallinfo safe to call from a signal handling context? I mean that
> in
> >     a practical sense (ie it doesn't try to acquire internal malloc
> locks)
> >     rather than being designated as async-signal-safe.
> >
> >
> > completely harmless, see int_malloc() in glibc malloc/malloc.c:
> > https://sources.debian.org/src/glibc/2.28-10/malloc/malloc.c/
> >
> > Does no locks, no calldowns to other functions (check_malloc_state is
> > empty in glibc release builds);
>
> Saw your update - ok.
>
> > Also similar coding has been active in our proprietary VM since about
> > five years so I think this is safe.
> >
> > --
> >
> > One other question just occurred to me, I wonder whether long is 64bit
> > on all our 64bit linux builds. Or whether I should use an explicit 64bit
> > type.
>
> "long" will be 32-bit on 32-bit Linux (ILP32) and 64-bit on 64-bit Linux
> (LP64). Does /proc/self/status specify what ranges the values may have?
> Perhaps size_t would be better (even though same size as long) ?
>
>
I will change the variables to size_t (even though the code will not as
compact as it is now).

> Thanks,
> David
> -----
>
>
I will post a new webrev when this is done and tested.

..Thomas

> >
> > Thanks, Thomas
> >
> >
> >     Thanks,
> >     David
> >     -----
> >
> >      > I did a number of manual tests, the numbers seem legit.
> >      >
> >      > Looks like this:
> >      >
> >      > Process Memory:
> >      > Virtual Size: 7494372K (peak: 7559908K)
> >      > Resident Set Size: 271264K (peak: 571348K) (anon: 226460K, file:
> >     44804K,
> >      > shmem: 0K)
> >      > Swapped out: 0K
> >      > C-Heap outstanding allocations: 37845K
> >      >
> >      >
> >      > Here I simulate a C-Heap memory leak:
> >      >
> >      > Process Memory:
> >      > Virtual Size: 9062260K (peak: 9118772K)
> >      > Resident Set Size: 1941776K (peak: 1941776K) (anon: 1896872K,
> >     file: 44904K,
> >      > shmem: 0K)
> >      > Swapped out: 0K
> >      > C-Heap outstanding allocations: 1291984K
> >      >
> >      >
> >      > Same leak, with the mallinfo value wrapped around:
> >      >
> >      > Process Memory:
> >      > Virtual Size: 12343352K (peak: 12445556K)
> >      > Resident Set Size: 5232084K (peak: 5281512K) (anon: 5187052K,
> >     file: 45032K,
> >      > shmem: 0K)
> >      > Swapped out: 0K
> >      > C-Heap outstanding allocations: 15454K (may have wrapped)
> >      >
> >      > Thanks, Thomas
> >      >
> >
>