Skip to content

Conversation

@ArangoGutierrez
Copy link
Contributor

This fixes an issue where nfd-master would remove all NFD-managed labels from nodes when NodeFeature objects were missing from the informer cache. This commonly occurs in large clusters (1k+ nodes) during startup due to pagination failures (410 Expired errors) when syncing the informer cache.

The fix introduces a pending delete tracking mechanism:

  • When a NodeFeature is explicitly deleted (DeleteFunc), the node is marked as having a pending delete
  • When processing a node update, if no NodeFeature is found in cache:
    • If there's a pending delete marker: proceed with label removal
    • If no pending delete marker: skip label removal (cache miss)

Additionally, a 'processedOnce' mechanism ensures that label removal is only skipped for nodes that haven't been successfully processed before. This prevents regressions where labels from deleted NodeFeatureRules wouldn't be cleaned up.

Also includes:

  • Fallback to object name in DeleteFunc when node-name label is missing
  • Cleanup of stale pendingDeletes/processedOnce when nodes are deleted
  • Comprehensive unit and E2E tests

Fixes: #2408

This fixes an issue where nfd-master would remove all NFD-managed labels
from nodes when NodeFeature objects were missing from the informer cache.
This commonly occurs in large clusters (1k+ nodes) during startup due to
pagination failures (410 Expired errors) when syncing the informer cache.

The fix introduces a pending delete tracking mechanism:
- When a NodeFeature is explicitly deleted (DeleteFunc), the node is
  marked as having a pending delete
- When processing a node update, if no NodeFeature is found in cache:
  - If there's a pending delete marker: proceed with label removal
  - If no pending delete marker: skip label removal (cache miss)

Additionally, a 'processedOnce' mechanism ensures that label removal is
only skipped for nodes that haven't been successfully processed before.
This prevents regressions where labels from deleted NodeFeatureRules
wouldn't be cleaned up.

Also includes:
- Fallback to object name in DeleteFunc when node-name label is missing
- Cleanup of stale pendingDeletes/processedOnce when nodes are deleted
- Comprehensive unit and E2E tests

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ArangoGutierrez

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link

netlify bot commented Jan 15, 2026

Deploy Preview for kubernetes-sigs-nfd ready!

Name Link
🔨 Latest commit 66535cc
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-nfd/deploys/6969667de634fc000827f976
😎 Deploy Preview https://deploy-preview-2415--kubernetes-sigs-nfd.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 15, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical issue where nfd-master would incorrectly remove all NFD-managed labels from nodes when NodeFeature objects were missing from the informer cache during startup, particularly in large clusters experiencing pagination failures (410 Expired errors).

Changes:

  • Introduces a pending delete tracking mechanism to distinguish between explicit NodeFeature deletions and cache misses
  • Adds a processedOnce tracking to ensure proper label cleanup for deleted NodeFeatureRules after initial processing
  • Implements cleanup of tracking data when nodes are deleted to prevent memory leaks

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/nfd-master/nfd-master.go Implements core tracking mechanisms (pendingDeletes, processedOnce) and logic to skip label removal on cache misses during first processing
pkg/nfd-master/nfd-api-controller.go Adds onDelete callback to mark nodes for pending delete when NodeFeature is deleted, with fallback to object name
pkg/nfd-master/updater-pool.go Adds cleanup of tracking data when nodes are not found to prevent memory leaks
pkg/nfd-master/nfd-master-internal_test.go Adds comprehensive unit tests for pending delete tracking and skipLabelRemoval functionality
test/e2e/node_feature_discovery_test.go Adds E2E test verifying labels are properly removed when NodeFeature is deleted via pending delete mechanism

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hasFeatures := len(nodeFeatures.Spec.Labels) > 0 ||
len(nodeFeatures.Spec.Features.Attributes) > 0 ||
len(nodeFeatures.Spec.Features.Flags) > 0 ||
len(nodeFeatures.Spec.Features.Instances) > 0
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hasFeatures check doesn't account for taints or annotations. If a NodeFeature only has taints or annotations but no other features, hasFeatures will be false, which may incorrectly trigger the skipLabelRemoval logic. Consider including checks for len(nodeFeatures.Spec.Taints) and len(nodeFeatures.Spec.Annotations) in the hasFeatures condition.

Suggested change
len(nodeFeatures.Spec.Features.Instances) > 0
len(nodeFeatures.Spec.Features.Instances) > 0 ||
len(nodeFeatures.Spec.Taints) > 0 ||
len(nodeFeatures.Spec.Annotations) > 0

Copilot uses AI. Check for mistakes.
}
}

func TestPendingDeleteTracking(t *testing.T) {
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TestPendingDeleteTracking test does not verify thread-safety of the pending delete operations. Since pendingDeletesMu is used to protect concurrent access in the actual code, consider adding a test case that calls markPendingDelete and consumePendingDelete concurrently from multiple goroutines to verify the mutex protection works correctly.

Copilot uses AI. Check for mistakes.
})
}

func TestUpdateNodeObjectSkipLabelRemoval(t *testing.T) {
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test does not verify the behavior of skipLabelRemoval=true with annotations. The code in updateNodeObject skips removal of both labels and annotations when skipLabelRemoval is true, but the test only checks labels. Add test cases to verify that existing annotations are also not removed when skipLabelRemoval=true.

Copilot uses AI. Check for mistakes.
// By convention, nfd-worker creates NodeFeatures with name == nodeName.
// This may mark a wrong node for pending delete (for third-party
// NodeFeatures), but that's harmless—it just gets consumed without effect.
klog.V(2).InfoS("failed to get node name from label, falling back to object name",
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback logic assumes that marking the wrong node for pending delete is 'harmless', but this could lead to unexpected behavior if a NodeFeature deletion coincides with processing a different node with the same name. Consider logging this at a higher severity level (V(1) or InfoS) since this represents a potential inconsistency in the tracking mechanism that operators should be aware of.

Suggested change
klog.V(2).InfoS("failed to get node name from label, falling back to object name",
klog.InfoS("failed to get node name from label, falling back to object name",

Copilot uses AI. Check for mistakes.
@ArangoGutierrez
Copy link
Contributor Author

/test pull-node-feature-discovery-build-image-cross-generic

@ArangoGutierrez
Copy link
Contributor Author

/assign @marquiz

@ArangoGutierrez
Copy link
Contributor Author

/test pull-node-feature-discovery-build-image-cross-generic

2 similar comments
@ArangoGutierrez
Copy link
Contributor Author

/test pull-node-feature-discovery-build-image-cross-generic

@ArangoGutierrez
Copy link
Contributor Author

/test pull-node-feature-discovery-build-image-cross-generic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Node labels removed and re-added on nfd-master restart

3 participants