Skip to content

Conversation

@agabrys
Copy link

@agabrys agabrys commented Apr 20, 2022

The activity timeout quite often stops the builds on our servers. Base on the log messages, it shouldn't do it. I analyzed the plugin source code and I figured out that the problem is caused by "lost" killer tasks. When multiple threads operate on the same TimeoutStepExecution object, it is possible that killers are not stopped when should be. I wanted to fix the current implementation, but it is very confusing, for example method interactions ("→" = calls):
resetTimersetupTimer → sets timer with: cancelresetTimersetupTimercancel

We are testing the new implementation on our server and we haven't hit any issues yet.

The introduced changes also made the timer more precise. The current one allowed extending the time by 1/10 time or more.

I'm aware that for some time we will have to use a forked version, so instead of overwriting the class, I introduced a new one - TimeoutStepExecutionThreadSafe. It is used instead of the original one when the org.jenkinsci.plugins.workflow.steps.TimeoutStep.threadsafe is set to true (false by default). It should also make the first review cycle easier, when diff it simple and the original class is easily available to compare with the new one.

I executed the TimeoutStepTest tests with the new implementation and all tests finished successfully.

I attached the built plugin, so you may test it - workflow-basic-steps.zip (941.vdfe1b_a_132c64 + fix) 🙂


git message:

When many activity timeouts are run at the same time, sometimes the "Sending interrupt signal to process" message appears and the build is aborted (JENKINS-58752). The "Cancelling nested steps due to timeout" message is never printed. The code has been refactored to prevent such issues:

  • the implementation of the activity and absolute timeouts have been separated to improve the code readability
  • the tasks executed after a delay are always created in synchronized sections, to prevent losing tasks which should be canceled
  • the timer logic of the activity timeout is changed from always stopping to verifying if the logic should be stopped or continued, so the number of timers is always under control (less instances)
  • the Tick class is replaced by a listener which notifies the step less frequently about the changes. The behavior could be controlled by setting the org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.activityNotifyWaitRatio property. It informs when the earliest the information about new activities should be sent to the timeout (time * ratio). When there were no activities in that time, then the next activity will be announced right after it has been reported

There are additional changes introduced in this commit:

  • the timeout id is always printed in the log messages which improves debugging (e.g. when timeouts are nested)
  • the logic is more precise. The previous implementation allowed exceeding the timeout by 1/10 of the time (JENKINS-63696). The new stops the process right after the timeout is reached. There is a property to allow exceeding it a little bit - org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.activityPrecision. It is necessary to not abort the logic due to delay in the notification process.

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@agabrys
Copy link
Author

agabrys commented Apr 21, 2022

When the review is done and the code is accepted - I'm going to:

  • move all changes from TimeoutStepExecutionThreadSafe to TimeoutStepExecution
  • revert changes from TimeoutStep

@agabrys agabrys force-pushed the bugfix/JENKINS-58752-JENKINS-63696 branch from db22812 to 283c9cd Compare April 21, 2022 09:23
When many activity timeouts are run at the same time, sometimes the "Sending interrupt signal to process" message appears and the build is aborted (JENKINS-58752). The "Cancelling nested steps due to timeout" message is never printed. The code has been refactored to prevent such issues:
- the implementation of the activity and absolute timeouts have been separated to improve the code readability
- the tasks executed after a delay are always created in synchronized sections, to prevent losing tasks which should be canceled
- the timer logic of the activity timeout is changed from always stopping to verifying if the logic should be stopped or continued, so the number of timers is always under control (less instances)
- the `Tick` class is replaced by a listener which notifies the step less frequently about the changes. The behavior could be controlled by setting the `org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.activityNotifyWaitRatio` property. It informs when the earliest the information about new activities should be sent to the timeout (`time * ratio`). When there were no activities in that time, then the next activity will be announced right after it has been reported

There are additional changes introduced in this commit:
- the timeout id is always printed in the log messages which improves debugging (e.g. when timeouts are nested)
- the logic is more precise. The previous implementation allowed exceeding the timeout by 1/10 of the time (JENKINS-63696). The new stops the process right after the timeout is reached. There is a property to allow exceeding it a little bit - `org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.activityPrecision`. It is necessary to not abort the logic due to delay in the notification process.
@agabrys agabrys force-pushed the bugfix/JENKINS-58752-JENKINS-63696 branch from 283c9cd to c44488d Compare April 21, 2022 11:24
@dsteiert
Copy link

The colleague that opened this PR no longer is in the department. To keep this going and keep it up-to-date, I opened a new PR to which I have access and can implement change requests in: #231.

This PR will therefore be closed.

@agabrys agabrys closed this Aug 22, 2022
@agabrys agabrys deleted the bugfix/JENKINS-58752-JENKINS-63696 branch August 22, 2022 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants