Skip to content

Comments

status: use hive jobs for probes#4

Open
MitchLewis930 wants to merge 1 commit intopr_044_beforefrom
pr_044_after
Open

status: use hive jobs for probes#4
MitchLewis930 wants to merge 1 commit intopr_044_beforefrom
pr_044_after

Conversation

@MitchLewis930
Copy link

@MitchLewis930 MitchLewis930 commented Jan 30, 2026

PR_044


Note

Medium Risk
Changes the startup/shutdown and readiness/health-reporting mechanics for status probes, which could affect agent health signals or probe lifetime under cancellation/timeouts.

Overview
Status probe startup is now driven by Hive jobs instead of lifecycle hooks. newStatusCollector registers a job.OneShot("probes") that starts probes, waits (with timeout) for the first successful run, marks allProbesInitialized, and closes probes when the job exits.

The previous startStatusCollector goroutine-based startup (and its cell.Health scope reporting) is removed, along with explicit probe shutdown from the OnStop hook; OnStop now only logs a KVStore-not-OK hint.

Written by Cursor Bugbot for commit 829f56e. This will update automatically on new commits. Configure here.

This commit refactores the probe initialization to use a Hive Job instead
of a plain lifecycle start hook. This way we can also get rid of the raw
Go routine to execute the check if every probe successfully executed at
least once before exposing the status.

Note: the kvstore "shutdown check" is still part of its own lifecycle
stop hook. Probably better to eventually move this to the kvstore module.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.


<-ctx.Done()
return nil
}))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probes stopped prematurely on initial timeout error

High Severity

When WaitForFirstRun times out, returning an error causes the defer collector.statusCollector.Close() to execute immediately, stopping all running probes. The previous implementation kept probes running even when the initial check failed, only closing them during OnStop. Now, if any probe takes longer than StatusCollectorProbeCheckTimeout (default 5 minutes) for its first run, the entire status collector becomes non-functional because all probes are terminated.

Fix in Cursor Fix in Web


<-ctx.Done()
return nil
}))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Health parameter received but never used for reporting

Medium Severity

The health cell.Health parameter passed to the job.OneShot function is never used. The old code explicitly called probeCheckHealth.Degraded() on failure and probeCheckHealth.OK() on success to report probe initialization status. The new code receives the health parameter but doesn't call any health methods, so the probe check health is never reported to the hive health system. The comment "Report health whether all probes have been executed at least once" remains, but the actual health reporting was omitted.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants