Usage of global scope by Linker is unsafe for pointers to `static` data in loaded library
some-java-user-99206970363698485155 at vodafonemail.de
some-java-user-99206970363698485155 at vodafonemail.de
Thu Sep 26 23:19:29 UTC 2024
Thanks a lot for all this information! Initially I thought this could
really just be solved by changing the default Linker downcall behavior,
but you are right that this would not help in all cases, and might break
existing use cases where the downcall result is newly allocated memory,
and therefore usage of the global scope is fine.
> Can you please share more about your use case?
The use case is https://github.com/tree-sitter/java-tree-sitter
(jtreesitter), but note that I am just a user of it, not a maintainer.
And that I am just starting to get more familiar with it.
In general Tree-sitter works like this (I hope I am noting saying
something wrong):
* You have the library shared by all parsers, see
https://github.com/tree-sitter/tree-sitter/blob/master/lib/include/tree_sitter/api.h
(and corresponding C code)
* You have specific parser implementations (whose C code is generated
by the tree-sitter CLI), for example
https://github.com/tree-sitter/tree-sitter-json/blob/master/src/parser.c
The parser implementations actually only have a single public function
`tree_sitter_...`, which returns a pointer to a TSLanguage struct with a
fixed binary format expected by the tree-sitter library. For example for
the JSON parser that is `const TSLanguage *tree_sitter_json(void) { ...
}`. The connection between the library and a specific parser happens
through this `tree_sitter_...` function: You call it to load the
TSLanguage struct, then pass it to the library and afterwards can start
using the library with that specific parser. See also
https://tree-sitter.github.io/tree-sitter/using-parsers
The jtreesitter project provides a wrapper around the tree-sitter
library. In jtreesitter this connection between general tree-sitter
library and parser happens with a MemorySegment in the Language
class[1], representing the TSLanguage pointer.
However, manually having to lookup the `tree_sitter_...` function,
obtain the Linker, create the function descriptor and perform the
downcall seemed a bit error-prone (especially for users new to the
foreign API), and repetitive because it is the same for all parser
implementations, I think (`tree_sitter_json`, `tree_sitter_python`,
...). Therefore I suggested to create a convenience method which only
takes a SymbolLookup, which was then implemented by the maintainer[2].
But only while using jtreesitter more I noticed this issue, that the
TSLanguage pointer actually points to static data, and if you unload the
library while you are still using it, the JVM crashes. It turns out this
is partially due to an incorrect (?) `reinterpret` call, but also due to
the Linker using the global scope for the result, see also [3]. I am
suggesting there now to not only take a SymbolLookup but also the
related Arena, to be able to set the Arena on the TSLanguage pointer to
fix its scope.
So maybe this is really a niche use case, and it can be solved by also
separately providing the Arena (though having to provide it separately
is a bit inconvenient).
As for why I am not simply using `Arena.global()` when loading the
parser library: I was thinking more about the case where you
accidentally not use the global Arena, and was hoping for it to still be
(relatively) safe when the Arena is closed too early, throwing an
exception in the worst case, but not crashing the JVM. And also a bit
similar to my previous message in
https://mail.openjdk.org/pipermail/jextract-dev/2024-August/001919.html,
I am still exploring ways to use a temporary file for the library and be
able to unload and delete it before JVM exit (therefore not use
`Arena.global()`).
But you are right, some of my concerns were also a bit theoretical,
extending from the issues I actually encountered.
I hope this topic did not cause too much distraction, and it would
likely be good to see if other use cases actually benefit from
JDK-8340641 (reinterpret with Scope) and JDK-8340642 (scope from
SymbolLookup) as well. Maybe they are really not needed, and users of
the foreign API can easily work around this or adjust their API slightly
to accommodate for this.
Thanks nonetheless for investigating this and thinking about possible
solutions.
Kind regards
[1]:
https://tree-sitter.github.io/java-tree-sitter/io/github/treesitter/jtreesitter/Language.html#%3Cinit%3E(java.lang.foreign.MemorySegment)
[2]:
https://github.com/tree-sitter/java-tree-sitter/commit/078a1c2b7c7ad823850444c3ba35c47379c5bd46
[3]: https://github.com/tree-sitter/java-tree-sitter/pull/37
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240927/038be45d/attachment.htm>
More information about the panama-dev
mailing list