-
Notifications
You must be signed in to change notification settings - Fork 297
nfd-master: prevent label removal on incomplete cache sync #2415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
nfd-master: prevent label removal on incomplete cache sync #2415
Conversation
This fixes an issue where nfd-master would remove all NFD-managed labels from nodes when NodeFeature objects were missing from the informer cache. This commonly occurs in large clusters (1k+ nodes) during startup due to pagination failures (410 Expired errors) when syncing the informer cache. The fix introduces a pending delete tracking mechanism: - When a NodeFeature is explicitly deleted (DeleteFunc), the node is marked as having a pending delete - When processing a node update, if no NodeFeature is found in cache: - If there's a pending delete marker: proceed with label removal - If no pending delete marker: skip label removal (cache miss) Additionally, a 'processedOnce' mechanism ensures that label removal is only skipped for nodes that haven't been successfully processed before. This prevents regressions where labels from deleted NodeFeatureRules wouldn't be cleaned up. Also includes: - Fallback to object name in DeleteFunc when node-name label is missing - Cleanup of stale pendingDeletes/processedOnce when nodes are deleted - Comprehensive unit and E2E tests Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ArangoGutierrez The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-nfd ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a critical issue where nfd-master would incorrectly remove all NFD-managed labels from nodes when NodeFeature objects were missing from the informer cache during startup, particularly in large clusters experiencing pagination failures (410 Expired errors).
Changes:
- Introduces a pending delete tracking mechanism to distinguish between explicit NodeFeature deletions and cache misses
- Adds a processedOnce tracking to ensure proper label cleanup for deleted NodeFeatureRules after initial processing
- Implements cleanup of tracking data when nodes are deleted to prevent memory leaks
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/nfd-master/nfd-master.go | Implements core tracking mechanisms (pendingDeletes, processedOnce) and logic to skip label removal on cache misses during first processing |
| pkg/nfd-master/nfd-api-controller.go | Adds onDelete callback to mark nodes for pending delete when NodeFeature is deleted, with fallback to object name |
| pkg/nfd-master/updater-pool.go | Adds cleanup of tracking data when nodes are not found to prevent memory leaks |
| pkg/nfd-master/nfd-master-internal_test.go | Adds comprehensive unit tests for pending delete tracking and skipLabelRemoval functionality |
| test/e2e/node_feature_discovery_test.go | Adds E2E test verifying labels are properly removed when NodeFeature is deleted via pending delete mechanism |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| hasFeatures := len(nodeFeatures.Spec.Labels) > 0 || | ||
| len(nodeFeatures.Spec.Features.Attributes) > 0 || | ||
| len(nodeFeatures.Spec.Features.Flags) > 0 || | ||
| len(nodeFeatures.Spec.Features.Instances) > 0 |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hasFeatures check doesn't account for taints or annotations. If a NodeFeature only has taints or annotations but no other features, hasFeatures will be false, which may incorrectly trigger the skipLabelRemoval logic. Consider including checks for len(nodeFeatures.Spec.Taints) and len(nodeFeatures.Spec.Annotations) in the hasFeatures condition.
| len(nodeFeatures.Spec.Features.Instances) > 0 | |
| len(nodeFeatures.Spec.Features.Instances) > 0 || | |
| len(nodeFeatures.Spec.Taints) > 0 || | |
| len(nodeFeatures.Spec.Annotations) > 0 |
| } | ||
| } | ||
|
|
||
| func TestPendingDeleteTracking(t *testing.T) { |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TestPendingDeleteTracking test does not verify thread-safety of the pending delete operations. Since pendingDeletesMu is used to protect concurrent access in the actual code, consider adding a test case that calls markPendingDelete and consumePendingDelete concurrently from multiple goroutines to verify the mutex protection works correctly.
| }) | ||
| } | ||
|
|
||
| func TestUpdateNodeObjectSkipLabelRemoval(t *testing.T) { |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test does not verify the behavior of skipLabelRemoval=true with annotations. The code in updateNodeObject skips removal of both labels and annotations when skipLabelRemoval is true, but the test only checks labels. Add test cases to verify that existing annotations are also not removed when skipLabelRemoval=true.
| // By convention, nfd-worker creates NodeFeatures with name == nodeName. | ||
| // This may mark a wrong node for pending delete (for third-party | ||
| // NodeFeatures), but that's harmless—it just gets consumed without effect. | ||
| klog.V(2).InfoS("failed to get node name from label, falling back to object name", |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback logic assumes that marking the wrong node for pending delete is 'harmless', but this could lead to unexpected behavior if a NodeFeature deletion coincides with processing a different node with the same name. Consider logging this at a higher severity level (V(1) or InfoS) since this represents a potential inconsistency in the tracking mechanism that operators should be aware of.
| klog.V(2).InfoS("failed to get node name from label, falling back to object name", | |
| klog.InfoS("failed to get node name from label, falling back to object name", |
|
/test pull-node-feature-discovery-build-image-cross-generic |
|
/assign @marquiz |
|
/test pull-node-feature-discovery-build-image-cross-generic |
2 similar comments
|
/test pull-node-feature-discovery-build-image-cross-generic |
|
/test pull-node-feature-discovery-build-image-cross-generic |
This fixes an issue where nfd-master would remove all NFD-managed labels from nodes when NodeFeature objects were missing from the informer cache. This commonly occurs in large clusters (1k+ nodes) during startup due to pagination failures (410 Expired errors) when syncing the informer cache.
The fix introduces a pending delete tracking mechanism:
Additionally, a 'processedOnce' mechanism ensures that label removal is only skipped for nodes that haven't been successfully processed before. This prevents regressions where labels from deleted NodeFeatureRules wouldn't be cleaned up.
Also includes:
Fixes: #2408