Skip to content

Comments

Add node neighbor link updater controller#9

Open
MitchLewis930 wants to merge 1 commit intopr_049_beforefrom
pr_049_after
Open

Add node neighbor link updater controller#9
MitchLewis930 wants to merge 1 commit intopr_049_beforefrom
pr_049_after

Conversation

@MitchLewis930
Copy link

PR_049

This controller is responsible for monitoring nodes that request a neighbor
link update.
It provides a single place in the application where all node neighbor link
are updates are performed as insertNeighbor are sprinkled in the code base.
Additionally given the central location we can provide additional
module health signals should neighbor link updates failed.

Signed-off-by: Fernand Galiana <fernand.galiana@gmail.com>
@greptile-apps
Copy link

greptile-apps bot commented Jan 30, 2026

Greptile Overview

Greptile Summary

This PR introduces a centralized queue-based controller for managing node neighbor link updates. The implementation replaces direct asynchronous insertNeighbor() calls with a queue system that processes node neighbor updates at a 10-second interval, providing better error reporting and health monitoring through the Hive cell health system.

Key Changes:

  • Added generic queue implementation (pkg/node/manager/queue.go) for managing node neighbor link update requests
  • Created StartNodeNeighborLinkUpdater() controller that processes queued nodes every 10 seconds
  • Refactored NodeNeighborRefresh() to accept a refresh parameter for better control over neighbor updates
  • Modified StartNeighborRefresh() to enqueue nodes instead of directly calling insertNeighbor()
  • Improved error handling in insertNeighbor() family of functions to return accumulated errors

Critical Issues Found:

  • Race condition in queue.pop(): The check for isEmpty() and subsequent pop() are not atomic, which can cause a panic when accessing an empty slice
  • Missing return in Enqueue(): When a nil node is detected, the function logs a warning but continues to enqueue it, leading to errors in the controller
  • Inefficient queue processing: The controller breaks on the first nil entry instead of draining all valid items from the queue, potentially creating a backlog with the 10-second interval

Confidence Score: 2/5

  • This PR has critical bugs that will cause runtime panics and poor performance
  • The race condition in queue.pop() can cause panics in production. The missing return in Enqueue() allows nil nodes to be queued, and the controller's break logic means only one node is processed per 10-second interval, creating significant performance issues. These must be fixed before merging.
  • pkg/node/manager/queue.go and pkg/node/manager/manager.go require immediate attention for critical bugs

Important Files Changed

Filename Overview
pkg/node/manager/queue.go New generic queue implementation with critical race condition in pop() method
pkg/node/manager/manager.go Added node neighbor link updater controller with queue-based processing; missing return in Enqueue() and inefficient loop logic in controller
pkg/node/manager/queue_test.go Added basic queue tests covering push/pop operations; tests don't cover concurrent access scenarios
pkg/datapath/linux/node.go Refactored neighbor insertion to use queue-based approach; improved error handling and reporting

Sequence Diagram

sequenceDiagram
    participant Daemon
    participant NodeManager
    participant Queue
    participant Controller
    participant NodeNeighbors
    participant NetLink

    Note over Daemon: Daemon startup
    Daemon->>NodeManager: StartNodeNeighborLinkUpdater(nh)
    NodeManager->>Controller: Create controller with RunInterval=10s
    
    Note over NodeManager: Node update triggered
    NodeManager->>NodeManager: nodeUpdate() or StartNeighborRefresh()
    NodeManager->>Queue: Enqueue(node, refresh)
    Queue-->>Queue: push(nodeQueueEntry)
    
    Note over Controller: Every 10 seconds
    Controller->>Queue: isEmpty()
    alt Queue has items
        loop Process all nodes
            Controller->>Queue: pop()
            Queue-->>Controller: nodeQueueEntry
            Controller->>NodeNeighbors: NodeNeighborRefresh(ctx, node, refresh)
            NodeNeighbors->>NetLink: insertNeighbor4/6()
            NetLink-->>NodeNeighbors: result
            alt Success
                NodeNeighbors-->>Controller: nil
                Controller->>Controller: Report health: OK
            else Error
                NodeNeighbors-->>Controller: error
                Controller->>Controller: Report health: Degraded
            end
        end
    else Queue empty
        Controller-->>Controller: return nil
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +143 to +149
func (m *manager) Enqueue(n *nodeTypes.Node, refresh bool) {
if n == nil {
log.WithFields(logrus.Fields{
logfields.LogSubsys: "enqueue",
}).Warn("Skipping nodeNeighbor insert: No node given")
}
m.nodeNeighborQueue.push(&nodeQueueEntry{node: n, refresh: refresh})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing return after warning - when n == nil, the function logs a warning but continues to enqueue a nil node, which will cause errors in the controller.

Suggested change
func (m *manager) Enqueue(n *nodeTypes.Node, refresh bool) {
if n == nil {
log.WithFields(logrus.Fields{
logfields.LogSubsys: "enqueue",
}).Warn("Skipping nodeNeighbor insert: No node given")
}
m.nodeNeighborQueue.push(&nodeQueueEntry{node: n, refresh: refresh})
func (m *manager) Enqueue(n *nodeTypes.Node, refresh bool) {
if n == nil {
log.WithFields(logrus.Fields{
logfields.LogSubsys: "enqueue",
}).Warn("Skipping nodeNeighbor insert: No node given")
return
}
m.nodeNeighborQueue.push(&nodeQueueEntry{node: n, refresh: refresh})
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: pkg/node/manager/manager.go
Line: 143:149

Comment:
Missing `return` after warning - when `n == nil`, the function logs a warning but continues to enqueue a nil node, which will cause errors in the controller.

```suggestion
func (m *manager) Enqueue(n *nodeTypes.Node, refresh bool) {
	if n == nil {
		log.WithFields(logrus.Fields{
			logfields.LogSubsys: "enqueue",
		}).Warn("Skipping nodeNeighbor insert: No node given")
		return
	}
	m.nodeNeighborQueue.push(&nodeQueueEntry{node: n, refresh: refresh})
}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +875 to +880
for {
e := m.nodeNeighborQueue.pop()
if e == nil || e.node == nil {
errs = errors.Join(errs, fmt.Errorf("invalid node spec found in queue: %#v", e))
break
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controller processes only one item per interval - the loop breaks immediately when pop() returns nil, but this should drain all items in the queue. With defaultNodeUpdateInterval of 10 seconds, this will only process one node every 10 seconds, creating a backlog.

Change break to continue to skip invalid entries without stopping queue processing:

Suggested change
for {
e := m.nodeNeighborQueue.pop()
if e == nil || e.node == nil {
errs = errors.Join(errs, fmt.Errorf("invalid node spec found in queue: %#v", e))
break
}
for {
e := m.nodeNeighborQueue.pop()
if e == nil {
break
}
if e.node == nil {
errs = errors.Join(errs, fmt.Errorf("invalid node spec found in queue: %#v", e))
continue
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: pkg/node/manager/manager.go
Line: 875:880

Comment:
Controller processes only one item per interval - the loop breaks immediately when `pop()` returns `nil`, but this should drain all items in the queue. With `defaultNodeUpdateInterval` of 10 seconds, this will only process one node every 10 seconds, creating a backlog.

Change `break` to `continue` to skip invalid entries without stopping queue processing:

```suggestion
				for {
					e := m.nodeNeighborQueue.pop()
					if e == nil {
						break
					}
					if e.node == nil {
						errs = errors.Join(errs, fmt.Errorf("invalid node spec found in queue: %#v", e))
						continue
					}
```

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link

greptile-apps bot commented Jan 30, 2026

Additional Comments (1)

pkg/node/manager/queue.go
Race condition: isEmpty() check at line 28 and lock acquisition at line 32 are not atomic. Between checking isEmpty() (which releases the read lock) and acquiring the write lock, another goroutine could pop the last item, causing a panic when accessing q.items[0] on an empty slice.

func (q *queue[T]) pop() *T {
	q.mx.Lock()
	defer q.mx.Unlock()
	
	if len(q.items) == 0 {
		return nil
	}
	
	t := q.items[0]
	q.items = q.items[1:]

	return t
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: pkg/node/manager/queue.go
Line: 27:38

Comment:
Race condition: `isEmpty()` check at line 28 and lock acquisition at line 32 are not atomic. Between checking `isEmpty()` (which releases the read lock) and acquiring the write lock, another goroutine could pop the last item, causing a panic when accessing `q.items[0]` on an empty slice.

```suggestion
func (q *queue[T]) pop() *T {
	q.mx.Lock()
	defer q.mx.Unlock()
	
	if len(q.items) == 0 {
		return nil
	}
	
	t := q.items[0]
	q.items = q.items[1:]

	return t
}
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants