Check archived jobs for last known job number before creating cluster#738
Open
timmartin-stripe wants to merge 1 commit intoNetflix:masterfrom
Conversation
We ran into a bug where we deleted a job cluster and then recreated the job cluster with the same name. The old job cluster had 4 stages and the new one had two. When a job was completed, it would write to the archived tables. However, there already existed a job cluster there with the same ID. The KV provider only overwrote the rows for stages 1 and 2. It did not delete the values for stages 3 and 4. When Mantis tried to load the archived job, it would see job metadata indicating 2 stages, but then would receive 4 stages (two from the new job and 4 from the old job). This would lead to the Mantis not loading the job. We could probably consider this a bug in the Dynamo KV Provider, _but_ it felt like we don't want to overwrite archived jobs in any scenario since we'd like to maintain a record of those jobs. Instead, the problem is further upstream. When we create a job, we should be reasonably confident that the Job ID is globally unique. However, when creating a job cluster, the `lastJobCount` value is always set to 0. We should instead check if there are any archived jobs with the same cluster name. If so, we should grab the last value and set that as the last known job number. We desire the following scenario 1. Create a job cluster "MyJob" 2. Create a job "MyJob-1" 3. Delete the job and job cluster 4. Create another job cluster MyJob 5. Create a job "MyJob-2" instead of "MyJob-1" Previously, we would have an archived job "MyJob-1" and an active job "MyJob-1" that are distinct. Stopping the active one would overwrite the archived one.
Andyz26
requested changes
Jan 3, 2025
| // e.g. If a job cluster with the name "MyJobCluster" was deleted, but had run a job, then | ||
| // we'd have a Job ID of `MyJobCluster-1` in the archived jobs table. When the new job cluster | ||
| // came online it may partially overwrite the old archived job if we started a new job with ID MyJobCluster-1. | ||
| List<CompletedJob> completedJobs = jobStore.loadCompletedJobsForCluster(request.getJobClusterDefinition().getName(), 1, null); |
Collaborator
There was a problem hiding this comment.
my main concern here is that this loadCompletedJobsForCluster call's logic is essentially doing a getAll (the limit is applied after getAll rows) while this creation logic is on the api response path with a default 1 sec timeout. Thus the costly call to getAll rows can cause timeout and other error here.
What about moving this to use a different call to only load the single row needed here instead of the full partition?
Another alternative would be to add a flag to the create request to make "global unique job id" an optional argument so we can still disable this call if needed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We ran into a bug where we deleted a job cluster and then recreated the job cluster with the same name. The old job cluster had 4 stages and the new one had two. When a job was completed, it would write to the archived tables. However, there already existed a job cluster there with the same ID. The KV provider only overwrote the rows for stages 1 and 2. It did not delete the values for stages 3 and 4.
When Mantis tried to load the archived job, it would see job metadata indicating 2 stages, but then would receive 4 stages (two from the new job and 4 from the old job). This would lead to the Mantis not loading the job.
We could probably consider this a bug in the Dynamo KV Provider, but it felt like we don't want to overwrite archived jobs in any scenario since we'd like to maintain a record of those jobs. Instead, the problem is further upstream. When we create a job, we should be reasonably confident that the Job ID is globally unique. However, when creating a job cluster, the
lastJobCountvalue is always set to 0. We should instead check if there are any archived jobs with the same cluster name. If so, we should grab the last value and set that as the last known job number.We desire the following scenario
Previously, we would have an archived job "MyJob-1" and an active job "MyJob-1" that are distinct. Stopping the active one would overwrite the archived one.
Context
Explain context and other details for this pull request.
Checklist
./gradlew buildcompiles code correctly./gradlew testpasses all tests