Source.baseURL() overhead when using custom URL protocol scheme / stream handler

Tue Jun 28 09:21:41 UTC 2016

Axel,

I’ve filed this bug and looked at various options for fixing it:

https://bugs.openjdk.java.net/browse/JDK-8160435

The simplest solution seems to be to use java.net.URI instead of java.net.URL. It provides a isOpaque() method which will properly recognize your URIs as non-hierarchical. It also provides a resolve() method to get the base URI and is not tied to I/O handlers.

I’ll be posting a request for review soon.

Hannes

> Am 28.06.2016 um 09:02 schrieb Hannes Wallnöfer <hannes.wallnoefer at oracle.com>:
> 
> Hi Axel,
> 
> Thanks for the explanation and code to reproduce the problem.
> 
> I’m looking at it right now.
> 
> Hannes
> 
> 
>> Am 27.06.2016 um 23:53 schrieb Axel Faust <axel.faust.g at googlemail.com>:
>> 
>> Hello,
>> 
>> TL;DR : I use custom URL protocol schemes and stream handlers that are not
>> globally registered. This causes excessive handler resolution overhead in
>> URL.getURLStreamHandler() called implicitly in Source.baseURL(). I can't
>> find a way to avoid this overhead (in JDK 1.8.0_71) without two impossible
>> choices: complete refactoring or registering a JVM global
>> URLStreamHandlerFactory.
>> A test case for sampling the overhead is provided in
>> https://gist.github.com/AFaust/04ec0c65a560e306b6b547dcaf38fd21
>> 
>> 
>> 
>> This is a follow-up to my tweet of mine from yesterday:
>> https://twitter.com/ReluctantBird83/status/747145726703075328
>> In this tweet I was commenting on an obversvation I made from CPU sampling
>> the current state of my Nashorn-based script engine for the open source ECM
>> platform Alfresco (https://github.com/AFaust/alfresco-nashorn-script-engine
>> ).
>> 
>> What prompted the comment where the following hot spot methods from my
>> jvisualvm sampling session, when I was testing a trivial ReST endpoint
>> backed by a Nashorn-executed script:
>> 
>> "Hot Spots - Method","Self Time [%]","Self Time","Self Time (CPU)","Total
>> Time","Total Time (CPU)","Samples"
>> "java.lang.invoke.LambdaForm$MH.771977685.linkToCallSite()","15.152575","793.365
>> ms","793.365 ms","1126.483 ms","1126.483 ms","63"
>> "java.net.URL.<init>()","11.350913","594.316 ms","594.316 ms","594.316
>> ms","594.316 ms","33"
>> "java.lang.Throwable.<init>()","7.248728","379.532 ms","379.532
>> ms","379.532 ms","379.532 ms","21"
>> [...]
>> "jdk.nashorn.internal.runtime.Source.baseURL()","0.0","0.0 ms","0.0
>> ms","594.316 ms","594.316 ms","33"
>> [...]
>> 
>> The 1st and 3rd hot spot are directly related to frequently called code in
>> my scripts / my utilities and somewhat expected, but I was not expecting
>> the URL constructor to be up there.
>> The backtraces view of the snapshot showed Source.baseURL() as the
>> immediate and only caller of the URL constructor, even though I have other
>> calls in my code which apparently don't trigger the sampling threshold.
>> The total time per execution of the script is around 50-60ms with few
>> outliers up to 90-100ms (sampling started only after reasonably stable
>> state was reached). Sampling was limited specifically on the jdk.nashorn.*,
>> jdk.internal.* and de.* packages.
>> 
>> A bit of background on my Alfresco Nashorn engine:
>> - embedded into a web application that may potentially run in Tomcat or JEE
>> servers (JBoss, WebSphere...)
>> - JavaScript in Alfresco is extensively used for embedded rules, policies
>> (event handling), ReST API endpoints and server-side UI pre-composition
>> - use of an AMD-like module system allowing flexible extension of script
>> API by 3rd party developers of Alfresco "addons"
>> - one file per module, lazily loaded when required by other module or
>> executed script
>> - frequently used "core" modules will be pre-loaded and cached on startup
>> - scripts are referenced via "logical" URLs using custom protocol schemes
>> to denote different script resolution and load scopes/mechanisms (example:
>> "webscript:///my/module/id" for a module in the lookup scope for ReST
>> endpoint scripts; some scripts may be user-managed within the content
>> repository / database itself)
>> - custom protocol schemes are handled by custom URL stream handlers *NOT*
>> globally registered (to avoid interfering with other web applications or
>> other URL-related functionality in the same JVM)
>> 
>> 
>> It turns out that the last two points are essential. I created a
>> generalised test case in a GitHub gist:
>> https://gist.github.com/AFaust/04ec0c65a560e306b6b547dcaf38fd21
>> Essentially it is URL.getURLStreamHandler() which is responsible for the
>> overhead. The Source.baseURL() creates a "base" name from the source URL
>> and if the protocol is not "file://" then a new URL will be created. Since
>> I use custom URL stream handlers and have not registered a global stream
>> handler factory (and won't ever do so), the new URL will try to resolve the
>> handler via URL.getURLStreamHandler(), go through all the hoops and always
>> fail in the end. A failed resolution is never cached, so every time
>> Source.baseURL() is called this whole process / overhead is repeated.
>> 
>> 
>> I am currently trying to reduce all global overheads of my script engine
>> setup, but can't find a way to avoid this overhead without registering a
>> global URL stream factory, which is out of the question for various reasons
>> (web application; 3rd party loaders; engine-specific semantics) or
>> completely refactoring the engine so all scripts are copied to simple
>> "file://" before execution (requiring constant sync-checking with original
>> script in source storage location).
>> 
>> Ideally, I would like the see options to provide both a base URL myself as
>> pre-resolved information via URLReader/Global.load() and register a custom
>> stream handler factory with my Nashorn engine instance. This would allow
>> "simple" loaders to use simple URL-Strings instead of real URL instances to
>> load script files via Global.load(), as well as "complex" loaders to
>> continue using state-ful custom URL stream handlers where necessary. And it
>> would allow Nashorn to resolve a potential custom URL stream handler before
>> relegating to default JVM global handling if no handler is found.
>> 
>> I am sure I am not aware of all the implications - and certainly I am aware
>> that such a change in a core class might be impossible - but
>> URL.getURLStreamHandler() should really cache failed stream handler
>> resolutions and avoid repeating the entire lookup routine...
>> 
>> 
>> Kind regards, and sorry for this overly long "summary"
>> Axel Faust
>