RFR(M): 8210754: print_location is not reliable enough (printing register info)

Mon Sep 24 14:26:54 UTC 2018

Hi Martin,

Thank you, that looks like a lot of work!

Some remarks:

http://cr.openjdk.java.net/~mdoerr/8210754_print_location/webrev.00/src/hotspot/share/classfile/classLoaderData.cpp.udiff.html

Can we put all that analysis stuff - e.g.
ClassLoaderDataGraph::is_valid - into DEBUG_ONLY ?

+  if (loader_data == ClassLoaderData::the_null_class_loader_data()) {
+    return true;
+  }

Init dependency here. the_null_class_loader_data() can be NULL early
on, so better test for NULL too.

==========

http://cr.openjdk.java.net/~mdoerr/8210754_print_location/webrev.00/src/hotspot/share/memory/metaspace.cpp.udiff.html
http://cr.openjdk.java.net/~mdoerr/8210754_print_location/webrev.00/src/hotspot/share/memory/metaspace.hpp.udiff.html

I feel these functions could be better put into an own file. We
slimmed down metaspace.cpp a lot and removed most into own files.
Either metaDebug.hpp/cpp or a new file under metaspace/...

--

I am not happy with exposing VirtualSpaceNode via metaspace.hpp. I
feel that is not necessary to solve to problem you have. All which is
needed, from the outside, are:

MetaspaceUtils::is_valid(ptr)
MetaspaceUtils::is_valid_range(from, to)

I even would avoid "_in_committed" as a wording, just use "_is_valid",
since this presumes less of the underlying implementation.

I also would make those two functions public, but DEBUG_ONLY, and
remove the vmErrorhelper friendship, but that is just my personal
taste.

--

In all case you could weed out obvious bogus values before actually
iterating the CLDG:

MetaspaceUtils::is_in_committed(void* p)

if (p != NULL)
or even
if (p < (reasonable low value)

may catch a whole range of cases where we have invalid pointers.

In fact, a function like this:

bool looks_like_a_real_pointer(void* p) {

}

could already weed out 80% of all cases, by checking reasonable low
and high limits and checking for alignment.

---

Do we really need the range check? How probable is it that
MetaspaceUtils::is_in_committed(ptr1) and
MetaspaceUtils::is_in_committed(ptr2) are ok but the range is invalid?
(It can happen, sure, but how probable is this?)

---

I cannot comment on the CDS sections. Looks fine to me, but I'm no expert here.

---

http://cr.openjdk.java.net/~mdoerr/8210754_print_location/webrev.00/src/hotspot/share/memory/metaspace/virtualSpaceList.hpp.udiff.html

I would not expose a "find_enclosing_space" but instead one - possibly
two - functions like this:

VistualSpaceList::is_valid(ptr)
(if we need range checking): VistualSpaceList::is_valid_range(from, to)

And all the way down to VistualSpaceNode.

Reason is that I would not expose functionality to the outside which
is not needed.

==========

http://cr.openjdk.java.net/~mdoerr/8210754_print_location/webrev.00/src/hotspot/share/utilities/vmErrorHelper.hpp.html

http://cr.openjdk.java.net/~mdoerr/8210754_print_location/webrev.00/src/hotspot/share/utilities/vmErrorHelper.cpp.html

Style nit: vmErrorHelper -> VMErrorHelper. Or maybe VMErrorUtils?

---

Style: for brevity, I wonder whether we can do away with all the
"_in_committed" and just assume that is implied?

e.g.:

is_range_in_committed_java_heap -> is_range_in_java_heap

or even shorter "is_in_java_heap" since two argument pointers imply we
do range checking.

Same argument goes for the metaspace functions too.

--

  46 bool vmErrorHelper::is_range_in_committed_java_heap(const void*
from, const void* to) {
  47   for (uintptr_t i = (uintptr_t)from; i != (uintptr_t)to; ++i) {
  48     if (!Universe::heap()->is_in((const void*) i)) {
  49       return false;
  50     }
  51   }
  52
  53   return true;
  54 }

Why the cast to uintptr_t?
Also, why bytewise iteration? Why not pagewise like in
vmErrorHelper::is_readable_range?

---

vmErrorHelper::is_range_in_committed_c_heap()

This function name is a blatant lie :-) - to the intent is to know if
a region is readable but not one of the known big JVM regions. I would
split this function in two more basic operations:

vmErrorHelper::is_known_range() - which would return true if a given
range is either heap, metaspace, ...

and the already existing is_readable_range().

Also, I see we use this only in is_valid_symbol(). Is it really
necessary to exclude the remote possibility that a given range is
valid but points to java heap or metaspace?

I would change is_valid_symbol() to:

- is symbol obvious garbage pointer?
- is [symbol, symbol+sizeof(Symbol)) readable?
- extract length. Is length obvious garbage? <0 or very large?
- is byte array readable?
- and if you are really thorough, check symbol byte array for
non-printable chars or zeros.

I would omit a metaspace-contains check or a java-heap-contains check
just to exclude the remote possibility that the range points there.

When in doubt, we may add a "safe_print" function to print out a
suspect char array like the one we extract from a suspect Symbol*,
where we could check each char for "is_ascii" before printing. But
this may be overdoing it.

---

  71 void* vmErrorHelper::load_klass_raw(oop obj) {

Name is misleading. How about "get_klass_pointer()" os similar?

Just dereferencing the given oop without prior checks?

---

  81 void* vmErrorHelper::load_oop_raw(oop obj, int offset) {

same remarks here.

---

bool vmErrorHelper::is_valid_klass(Klass* k) {

- better entry checks.
- alignment checks for sizeof(MetaWord), not for intptr_t
- You can loose the "is_valid_symbol". Symbols live in C heap anyway.
- I feel that the CLDG walk is overdoing it a bit. If the given
pointer points to metaspace, I would say chance is high that thing is
valid.

---

 133 bool vmErrorHelper::is_valid_oop(oop obj) {

- why not using is_valid_klass? to check the extracted klass pointer?

---

oop vmErrorHelper::oop_or_null(address addr) {
  HeapWord* p = Universe::heap()->block_start(addr);
  // If we couldn't find it it just may mean that heap wasn't parsable
  // See if we were just given an oop directly
  if (p != NULL && Universe::heap()->block_is_obj(p)) {
    return oop(p);
  } else if (p == NULL && is_valid_oop(oop(addr))) {
    return oop(addr);
  }
  return NULL;
}

Can you move the comment into the else branch please?

Also, a short comment would be helpful describing what this function
really does (true for the other functions too).

e.g.:

"Given an address which may point at or into a java object, returns
the oop for the java object. NULL otherwise."

Would it make sense to revert the order of the check? First
is_valid_oop, then heap()->block_start()? Since more frequently one
has an oop, not an address into an object?

---

 166 void vmErrorHelper::dump_code_blob(address addr, CodeBlob* b,
outputStream* st, bool verbose) {

I really do not like the use of ResourceMarks here! This assumes a
whole lot of things work which may be broken already.
Thread::current(), arena allocation etc.

===

http://cr.openjdk.java.net/~mdoerr/8210754_print_location/webrev.00/src/hotspot/share/runtime/os.cpp.udiff.html

I do not understand why the compressed-oop and non-compressed-oop case
are so different?

In the latter case we do vmErrorHelper::oop_or_null(addr); so we
handle pointer into objects. In the former case we dont. Why?

+      st->print(UINT32_FORMAT " is a compressed pointer to object: ",
narrow_oop);

I would probably print out the decompressed oop too, before calling
oop->print_on.

1127   if (UseCompressedClassPointers && ((uintptr_t)addr &~
(uintptr_t)max_juint) == 0) {

We print klass in case the addr is a compressed klass pointer, but I
did not find a corresponding function if addr is just an uncompressed
klass pointer?

========================================

Bottomline, this all feels a bit on the heavy side for just debug
printing. I would generally prefer all that code being DEBUG_ONLY, if
only for readibility. Also, I think in many cases we could do less
in-depth checking (e.g. walking the CLDG) and instead do some generic
sanity checks for a given value which claims to be a valid
pointer/oop/MetaWord* etc.

E.g. just a simple

vmErrorHelper::looks_like_a_reasonable_(heap|metaspace|whatever)_pointer(addr) {
 - check for NULL
 - check sensible lower and upper limits
 - check valid alignment
}

One even could maintain outer limits in the components: e.g. it would
not be hard to keep track of the highest and lowest ever-allocated
ClassLoaderData*. That way, is_valid_classloaderdata() could just do a
range check between these areas and probably catch 98% of all cases
with garbage input. That way, we would not have to walk data
structures in error handling. (Still I would combine this always with
"is_readable_range").

We do not aim for 100%, just reasonable enough heuristics. Most of the
invalid-data cases are not almost-correct-pointers but complete
garbage, like NULL or very small or large values.

Just my 5 cents

Cheers,

Thomas

On Fri, Sep 14, 2018 at 4:03 PM, Doerr, Martin <martin.doerr at sap.com> wrote:
> Hi,
>
> I'd like to make os::print_location more reliable which is used in error reporting step "printing register info". Oops and Klasses should get inspected more carefully.
> I have seen errors like "[error occurred during error reporting (printing register info), id 0xe0000000, Internal Error (/usr/work/d056149/openjdk/jdk/src/hotspot/share/oops/klass.inline.hpp:63)]" in many hs_err files.
>
> Bug:
> https://bugs.openjdk.java.net/browse/JDK-8210754
>
> Webrev:
> http://cr.openjdk.java.net/~mdoerr/8210754_print_location/webrev.00/
>
> Sometimes, I get such errors when using -XX:+CrashGCForDumpingJavaThread, sometimes when injecting crashing code into compiled methods which I did by the following code:
> http://cr.openjdk.java.net/~mdoerr/crash_C2_method/webrev.00/
> I can also contribute this if it's desired. Automatic tests would certainly be nice to have. Maybe I can find some time for that.
>
> Please review.
>
> Best regards,
> Martin
>