-
Notifications
You must be signed in to change notification settings - Fork 15
Add job to cancel workflow on failure #139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Since HPE rdma tests are quickest (only 15-20 min) if they fail, then fail the entire workflow run, all jobs to save some free tier runners
mhae
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
mhae
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I remembered that this probably not going to work.
See https://github.com/orgs/community/discussions/26311
I tried implementing the cancel early on when I did the runners and didn't get it to work.
we cancelling the jobs like this today, see https://github.com/spdk/spdk-ci/blob/main/.github/scripts/parse_gerrit_webhook.sh |
| with: | ||
| path: ./output | ||
| name: hpe-job-nvmf-rdma | ||
| - name: Cancel all jobs enitre workflow run if this job failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the same be added to common tests workflow? In this one common tests part failed after a minute or so, but HPE runner was still busy anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It rarely happens. Since how runner only executes a single test. It was true only when how was offline for a bit…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rarely, but still - why keep it busy in such a case? Honestly it was the first failed build I clicked on, so maybe it happens not as rarely :)
| - name: Cancel all jobs enitre workflow run if this job failed | ||
| if: failure() | ||
| env: | ||
| GH_TOKEN: ${{ github.token }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing the token to a job that is 1) executing scripts from SPDK repo and 2) running on self-hosted runner, allows either to use that token.
Please correct me here, but my understanding is that passing the token or secrets to such job is always risky and should be avoided.
Correct way to add this cancelation would be to add a job on GH hosted runner instead of a step, that could get that token. Considering the purpose of this patch is to decrease time to cancel, I'll leave it up to you if you think it is worth do queue up another job.
@mhae I'm still unsure which part of the linked discussion was concerning, could you clarify ? |
My main concern is whether forwarding a cancel to a currently running job works correctly. I had some issues early on where not all parts of a job running on the self-hosted runner were canceled. Reading through the description again, it seems we're not doing a cancel of the job of the self-hosted runner, is this correct? |
That's correct. This PR is intended to cancel other jobs from the workflow, which at this time is only github-hosted runners. Unfortunately the cancels we already have in parse_gerrit_webhook.sh won't help to alleviate any of those concerns, since it cancels the only running job in workflow from that job. Besides the concern here I think it might be worth giving it a shot, if it handles cancelation of all the running and pending jobs from common_tests. |
mhae
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing my -1
Since HPE rdma tests are quickest (only 15-20 min) if they fail, then fail the entire workflow run, all jobs
to save some free tier runners