Usage of global scope by Linker is unsafe for pointers to `static` data in loaded library

some-java-user-99206970363698485155 at vodafonemail.de some-java-user-99206970363698485155 at vodafonemail.de
Thu Sep 26 23:19:29 UTC 2024


Thanks a lot for all this information! Initially I thought this could 
really just be solved by changing the default Linker downcall behavior, 
but you are right that this would not help in all cases, and might break 
existing use cases where the downcall result is newly allocated memory, 
and therefore usage of the global scope is fine.

> Can you please share more about your use case?
The use case is https://github.com/tree-sitter/java-tree-sitter 
(jtreesitter), but note that I am just a user of it, not a maintainer. 
And that I am just starting to get more familiar with it.
In general Tree-sitter works like this (I hope I am noting saying 
something wrong):

  * You have the library shared by all parsers, see
    https://github.com/tree-sitter/tree-sitter/blob/master/lib/include/tree_sitter/api.h
    (and corresponding C code)
  * You have specific parser implementations (whose C code is generated
    by the tree-sitter CLI), for example
    https://github.com/tree-sitter/tree-sitter-json/blob/master/src/parser.c

The parser implementations actually only have a single public function 
`tree_sitter_...`, which returns a pointer to a TSLanguage struct with a 
fixed binary format expected by the tree-sitter library. For example for 
the JSON parser that is `const TSLanguage *tree_sitter_json(void) { ... 
}`. The connection between the library and a specific parser happens 
through this `tree_sitter_...` function: You call it to load the 
TSLanguage struct, then pass it to the library and afterwards can start 
using the library with that specific parser. See also 
https://tree-sitter.github.io/tree-sitter/using-parsers

The jtreesitter project provides a wrapper around the tree-sitter 
library. In jtreesitter this connection between general tree-sitter 
library and parser happens with a MemorySegment in the Language 
class[1], representing the TSLanguage pointer.
However, manually having to lookup the `tree_sitter_...` function, 
obtain the Linker, create the function descriptor and perform the 
downcall seemed a bit error-prone (especially for users new to the 
foreign API), and repetitive because it is the same for all parser 
implementations, I think (`tree_sitter_json`, `tree_sitter_python`, 
...). Therefore I suggested to create a convenience method which only 
takes a SymbolLookup, which was then implemented by the maintainer[2].

But only while using jtreesitter more I noticed this issue, that the 
TSLanguage pointer actually points to static data, and if you unload the 
library while you are still using it, the JVM crashes. It turns out this 
is partially due to an incorrect (?) `reinterpret` call, but also due to 
the Linker using the global scope for the result, see also [3]. I am 
suggesting there now to not only take a SymbolLookup but also the 
related Arena, to be able to set the Arena on the TSLanguage pointer to 
fix its scope.

So maybe this is really a niche use case, and it can be solved by also 
separately providing the Arena (though having to provide it separately 
is a bit inconvenient).

As for why I am not simply using `Arena.global()` when loading the 
parser library: I was thinking more about the case where you 
accidentally not use the global Arena, and was hoping for it to still be 
(relatively) safe when the Arena is closed too early, throwing an 
exception in the worst case, but not crashing the JVM. And also a bit 
similar to my previous message in 
https://mail.openjdk.org/pipermail/jextract-dev/2024-August/001919.html, 
I am still exploring ways to use a temporary file for the library and be 
able to unload and delete it before JVM exit (therefore not use 
`Arena.global()`).

But you are right, some of my concerns were also a bit theoretical, 
extending from the issues I actually encountered.


I hope this topic did not cause too much distraction, and it would 
likely be good to see if other use cases actually benefit from 
JDK-8340641 (reinterpret with Scope) and JDK-8340642 (scope from 
SymbolLookup) as well. Maybe they are really not needed, and users of 
the foreign API can easily work around this or adjust their API slightly 
to accommodate for this.
Thanks nonetheless for investigating this and thinking about possible 
solutions.


Kind regards


[1]: 
https://tree-sitter.github.io/java-tree-sitter/io/github/treesitter/jtreesitter/Language.html#%3Cinit%3E(java.lang.foreign.MemorySegment)
[2]: 
https://github.com/tree-sitter/java-tree-sitter/commit/078a1c2b7c7ad823850444c3ba35c47379c5bd46
[3]: https://github.com/tree-sitter/java-tree-sitter/pull/37

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240927/038be45d/attachment.htm>


More information about the panama-dev mailing list