Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Jan 15, 2026

Summary

Implements automatic deletion of support bundles to maintain a buffer of free debug datasets for new allocations.

  • Config: Added target_free_datasets and min_bundles_to_keep (Option<NonZeroU32>)
  • DB Query: Added support_bundle_auto_delete() CTE + unit tests
  • Background Task: Added auto_delete_bundles() phase before existing phases
  • Schema: Added index on (state, time_created) for efficient queries
  • omdb: Display auto-deletion report

How it works

The SupportBundleCollector background task now runs three phases:

  1. Auto-delete (new): Marks oldest Active bundles as Destroying when free datasets < target
  2. Cleanup (existing): Cleans up storage and DB for Destroying bundles
  3. Collect (existing): Collects pending bundles

Deletion respects min_bundles_to_keep to protect small systems from aggressive cleanup.

Tests

  • test_auto_deletion_no_bundles - No deletion when there are enough free datasets and no bundles exist
  • test_auto_deletion_enough_free_datasets - No deletion when free datasets already meet the target
  • test_auto_deletion_deletes_oldest_first - Verifies oldest bundles (by time_created) are selected for deletion
  • test_auto_deletion_respects_min_bundles_to_keep - Deletion is limited when it would leave fewer than min_bundles_to_keep bundles
  • test_auto_deletion_min_bundles_prevents_all_deletion - No deletion when min_bundles_to_keep exceeds active bundle count
  • test_auto_deletion_only_selects_active_bundles - Only Active bundles are deleted; Collecting/Destroying bundles are skipped but still count as occupying datasets
  • test_auto_deletion_verifies_state_transition - Verifies bundles are actually transitioned to Destroying state in the database
  • test_auto_deletion_failed_bundles_dont_occupy_datasets - Failed bundles don't count toward used dataset count (their dataset was expunged)
  • test_auto_delete_query_explains - Validates the CTE is valid SQL via EXPLAIN
  • expectorate_auto_delete_query - Captures SQL output for inspection/change detection
  • Integration test in test_support_bundle_auto_deletion()

Fixes #9660

@smklein smklein marked this pull request as draft January 15, 2026 21:19
@smklein smklein force-pushed the support-bundle-auto-delete branch from 5bc38cc to c78018e Compare January 15, 2026 21:26
@smklein smklein force-pushed the support-bundle-auto-delete branch from c78018e to 1c94bc0 Compare January 15, 2026 21:38
- Use COUNT queries instead of loading all dataset/bundle rows
- Use GROUP BY to get used_datasets and active_bundles in a single query
- Add LIMIT to the deletion candidates query
- Exclude Failed bundles from used_datasets (their dataset was expunged)

These changes allow the auto-deletion query to scale to systems with
thousands of datasets without loading all rows into memory.
Replace the two-step find-then-update approach with a single atomic CTE
query that calculates how many deletions are needed AND performs them
in one database operation.

The previous approach had a time-of-check to time-of-use (TOCTTOU) issue
where multiple Nexuses running concurrently could over-delete bundles:
1. Nexus A queries: free=2, needs 1 deletion -> gets B1
2. Nexus A transitions B1: Active -> Destroying
3. Nexus B queries: free=2 (unchanged, Destroying still occupies), needs 1 -> gets B2
4. Nexus B transitions B2: Active -> Destroying
5. Result: 2 bundles deleted when only 1 was needed

The new atomic query:
- Calculates free datasets and needed deletions
- Respects min_bundles_to_keep constraint
- Finds the N oldest Active bundles
- Transitions them to Destroying state atomically
- Returns the IDs of deleted bundles

Also adds tests verifying:
- Bundles are actually transitioned to Destroying state
- Failed bundles don't count as occupying datasets
- Explain test for SQL validity
- Expectorate test for SQL output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Bundle: Automatic Deletion

2 participants