[crac] RFR: Correct System.nanotime() value after restore

Radim Vansa duke at openjdk.org
Tue Mar 28 14:33:52 UTC 2023


On Tue, 28 Mar 2023 14:21:20 GMT, Ashutosh Mehra <duke at openjdk.org> wrote:

>> @ashu-mehra The main point of this change is *not* about whether the time being suspended should be observed or not; I am rather worried about moving the process to another system and getting totally nonsense results from nanotime diffs, and broken code.
>> 
>> I understand that observing the suspended can be a subject to further discussion, though I incline towards the visibility of such interval, as implemented here. Since this fixes some use cases and does not change what wouldn't be broken (on a single system the paused time with system running would be observed anyway unless the whole machine was suspended) I suggest to merge this as-is without considering the topic resolved forever.
>> 
>> The fact that some timers use this as the time source rather than wall clock time is an implementation detail. Applications performing checkpoint and restore will require some tweaks to perform correctly and I intend to work on ways to deal with timers.
>
>> I am rather worried about moving the process to another system and getting totally nonsense results from nanotime diffs, and broken code.
> 
> @rvansa  I am assuming you are seeing time going backwards with System.nanoTime() calls after restore. Is that correct?
> Its interesting if you are seeing such absurd results after restore, because IIUC criu is using time namespace to update the clock offsets in `/proc/<pid>/timens_offsets`, so I wouldn't expect System.nanoTime() to give absurd results on restore.  
> I did some tests with C program that call `clock_gettime(CLOCK_MONOTONIC)` before and after restore. In between I restarted my system. I didn't see time going backwards; all I could observe is that the elapsed time between checkpoint and restore was not taken into account.
> 
> Can you please provide more details about your observation on nonsense results. Does the system where the absurd behavior is observed support time namepace ? Which version of criu did you use?

@ashu-mehra To be honest I didn't know about this being handled in CRIU - I thought that the offsets can be set only once in the newly created namespace. Does CRIU create another ns for the process?

I have not observed the problem in practice, rather anticipated it as some while ago I ran into similar problem with GraalVM inlining nanotime in a static final variable (caused 100% CPU usage on some machines and normal behaviour on others). I shall rerun the enclosed test without the fix applied but I think it was failing.

-------------

PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1487007327


More information about the crac-dev mailing list