Skip to content

Comments

status: use hive jobs for probes#4

Open
MitchLewis930 wants to merge 1 commit intopr_044_beforefrom
pr_044_after
Open

status: use hive jobs for probes#4
MitchLewis930 wants to merge 1 commit intopr_044_beforefrom
pr_044_after

Conversation

@MitchLewis930
Copy link

@MitchLewis930 MitchLewis930 commented Jan 30, 2026

User description

PR_044


PR Type

Enhancement


Description

  • Refactor probe initialization to use Hive Job instead of lifecycle hook

  • Replace raw Go routine with managed job execution for probe checks

  • Simplify health reporting by removing Health scope dependency

  • Improve context management with proper timeout handling


Diagram Walkthrough

flowchart LR
  A["Lifecycle Hook"] -->|"replaced by"| B["Hive Job OneShot"]
  B -->|"manages"| C["Probe Initialization"]
  C -->|"waits for"| D["First Probe Run"]
  D -->|"sets flag"| E["allProbesInitialized"]
  F["Raw Go Routine"] -->|"removed"| B
  G["Health Scope"] -->|"removed"| B
Loading

File Walkthrough

Relevant files
Enhancement
cell.go
Convert probe initialization to Hive Job pattern                 

pkg/status/cell.go

  • Added imports for context, fmt, and job packages
  • Replaced cell.Health dependency with job.Group in statusParams
  • Implemented probe initialization as a Hive Job OneShot instead of
    lifecycle hook
  • Job handles probe startup, timeout management, and health reporting
  • Removed OnStart hook and simplified OnStop hook by removing Close()
    call
+27/-7   
status_collector.go
Remove legacy probe startup implementation                             

pkg/status/status_collector.go

  • Removed startStatusCollector() method entirely
  • Eliminated raw Go routine for probe health checking
  • Removed Health scope creation and reporting logic
  • Probe initialization logic now managed by Hive Job in cell.go
+0/-25   

This commit refactores the probe initialization to use a Hive Job instead
of a plain lifecycle start hook. This way we can also get rid of the raw
Go routine to execute the check if every probe successfully executed at
least once before exposing the status.

Note: the kvstore "shutdown check" is still part of its own lifecycle
stop hook. Probably better to eventually move this to the kvstore module.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
@qodo-code-review
Copy link

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing audit context: The new probe-start flow adds debug logs but does not include any audit-relevant context
(e.g., actor/user identity) which may be required if probe/job execution is considered a
critical action in this system.

Referred Code
params.JobGroup.Add(job.OneShot("probes", func(ctx context.Context, health cell.Health) error {
	params.Logger.Debug("Starting probes")
	collector.statusCollector.StartProbes(collector.getProbes())
	defer collector.statusCollector.Close()
	params.Logger.Debug("Successfully started probes")

	waitCtx, cancelWait := context.WithTimeout(ctx, params.Config.StatusCollectorProbeCheckTimeout)
	defer cancelWait()

	// Report health whether all probes have been executed at least once.
	if err := collector.statusCollector.WaitForFirstRun(waitCtx); err != nil {
		params.Logger.Debug("Not all probes successfully executed at least once")
		return fmt.Errorf("not all probes successfully executed at least once: %w", err)
	}

	collector.allProbesInitialized = true

	params.Logger.Debug("All probes executed at least once")

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Error exposure unclear: The job returns a detailed internal error (not all probes successfully executed at least
once: %w) and without visibility into how job errors are surfaced, it is unclear whether
this message could reach user-facing outputs.

Referred Code
if err := collector.statusCollector.WaitForFirstRun(waitCtx); err != nil {
	params.Logger.Debug("Not all probes successfully executed at least once")
	return fmt.Errorf("not all probes successfully executed at least once: %w", err)
}

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Return context error on shutdown

In the OneShot job, return ctx.Err() after the context is done to correctly
propagate the shutdown cause, instead of returning nil.

pkg/status/cell.go [137-138]

 		<-ctx.Done()
-		return nil
+		return ctx.Err()
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that returning ctx.Err() instead of nil aligns with the hive/job contract, ensuring proper error propagation during shutdown.

Medium
General
Report health on probe init

Use the injected health parameter to report the status of probe initialization,
calling health.Degraded() on failure and health.OK() on success.

pkg/status/cell.go [128-135]

 if err := collector.statusCollector.WaitForFirstRun(waitCtx); err != nil {
     params.Logger.Debug("Not all probes successfully executed at least once")
+    health.Degraded("Probe initialization failed", err)
     return fmt.Errorf("not all probes successfully executed at least once: %w", err)
 }
+health.OK("All probes executed at least once")
 collector.allProbesInitialized = true
 params.Logger.Debug("All probes executed at least once")
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: This suggestion correctly identifies that health reporting was omitted in the refactoring and proposes re-adding it, which restores important observability functionality.

Medium
  • More

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants