topology-updater: Track new resources not present during startup #1897

ozhuraki · 2024-10-03T18:23:12Z

If a device plugin registers itself with kubelet after NFD topology-updater is already running the NRT will never be updated with new resources.

Closes: #1424

k8s-ci-robot · 2024-10-03T18:23:22Z

Hi @ozhuraki. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2024-10-03T18:23:29Z

✅ Deploy Preview for kubernetes-sigs-nfd ready!

Name	Link
🔨 Latest commit	`0c8a8fa`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-nfd/deploys/67585cf26566de0008f1cff0
😎 Deploy Preview	https://deploy-preview-1897--kubernetes-sigs-nfd.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ozhuraki · 2024-10-03T18:23:40Z

@PiotrProkop

Could you please take a look if this is a right place to track new resources?

PiotrProkop · 2024-10-07T12:32:26Z

I don't think that's the right place to fix it as later call to updateAvailable will ignore the resource not present during startup here -> https://github.com/kubernetes-sigs/node-feature-discovery/blob/master/pkg/resourcemonitor/noderesourcesaggregator.go#L246

ozhuraki · 2024-10-07T18:42:47Z

@PiotrProkop

Thanks for the help. I checked more, would something like this account for updateAvailable()?

pkg/resourcemonitor/noderesourcesaggregator.go

Signed-off-by: Oleg Zhurakivskyy <[email protected]>

ozhuraki · 2024-12-10T15:24:56Z

@PiotrProkop

Apologies for a delay. Thanks, updated, could you please take another look?

k8s-triage-robot · 2025-03-10T15:37:00Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2025-04-09T15:42:24Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

marquiz · 2025-05-07T12:56:36Z

@PiotrProkop could you take a look?

/ok-to-test
/remove-lifecycle rotten

marquiz · 2025-05-08T07:25:04Z

@ozhuraki please fix the linter error(s)

PiotrProkop · 2025-05-08T07:28:35Z

@ozhuraki please add unit test for this functionality. I'll try to review more in depth today/tomorrow.

k8s-triage-robot · 2025-08-06T08:23:39Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2025-09-05T08:26:48Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2025-10-05T08:42:21Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2025-10-05T08:42:27Z

@k8s-triage-robot: Closed this PR.

Details

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

marquiz · 2025-10-06T09:12:41Z

/reopen

k8s-ci-robot · 2025-10-06T09:12:47Z

@marquiz: Reopened this PR.

Details

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-10-06T09:12:53Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ozhuraki
Once this PR has been reviewed and has the lgtm label, please assign marquiz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-10-06T11:14:46Z

@ozhuraki: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-node-feature-discovery-verify-master	`0c8a8fa`	link	true	`/test pull-node-feature-discovery-verify-master`
pull-node-feature-discovery-build-image-cross-generic	`0c8a8fa`	link	true	`/test pull-node-feature-discovery-build-image-cross-generic`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

marquiz · 2025-10-13T10:24:22Z

@ozhuraki any update on this one?

/remove-lifecycle rotten

k8s-triage-robot · 2026-01-11T10:49:20Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ArangoGutierrez · 2026-01-15T11:00:43Z

/remove-lifecycle stale

Copilot

Pull request overview

This PR aims to fix an issue where the NFD topology-updater doesn't track new resources from device plugins that register with kubelet after the updater has already started. The change adds logic to dynamically track and map new resources to NUMA nodes when they are first encountered during the available resource update process.

Changes:

Adds dynamic resource tracking in updateAvailable to handle device plugins that register after startup
Creates resource-to-NUMA-node mappings on-the-fly when new resources are detected

Comments suppressed due to low confidence (1)

pkg/resourcemonitor/noderesourcesaggregator.go:281

This line will panic with a nil pointer dereference if numaData[nodeID][ri.Name] doesn't exist, which is the exact scenario this PR is trying to handle (new resources not present during startup). Even if the resourceID2NUMAID mapping is successfully created in the new code block, the corresponding entry in numaData must also exist before decrementing available. The new resource tracking logic is incomplete without also initializing the resource entry in the numaData structure.

		numaData[nodeID][ri.Name].available--

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-15T11:05:24Z

pkg/resourcemonitor/noderesourcesaggregator.go

+				continue
+			}
+
+			for _, resID := range ri.Data {


The nested loop structure incorrectly maps all resource IDs to each NUMA node in sequence, causing the last NUMA node to overwrite previous mappings. If ri.NumaNodeIds contains multiple nodes [0, 1] and ri.Data contains ['dev0', 'dev1'], both devices will incorrectly be mapped only to the last NUMA node (1). The ResourceInfo structure doesn't provide device-to-node pairing, so this approach cannot correctly establish the mapping. Consider whether the ResourceInfo needs to be enhanced to include explicit device-to-NUMA-node pairing, or if the mapping logic needs to rely on the original topology information from the Pod Resources API.

Suggested change

for _, resID := range ri.Data {

for _, resID := range ri.Data {

// Only set the mapping if this resource ID has not been mapped yet.

if _, exists := resMap[resID]; exists {

continue

}

Copilot · 2026-01-15T11:05:25Z

pkg/resourcemonitor/noderesourcesaggregator.go

 		resMap, ok := noderesourceData.resourceID2NUMAID[resName]
 		if !ok {


This lookup is redundant since resMap was already retrieved (and potentially populated) at lines 243-263. The variable resMap is still in scope and can be reused directly, eliminating the need for this duplicate lookup.

Copilot · 2026-01-15T11:05:25Z

pkg/resourcemonitor/noderesourcesaggregator.go

+			if _, ok := numaData[numaNodeID][ri.Name]; !ok {
+				klog.InfoS("failed to find resource under the node topology", "resourceName", ri.Name)
+				continue
+			}


This check verifies if the resource exists in numaData before adding it to the resourceID2NUMAID map, but the resource may legitimately not exist yet in numaData if this is a newly registered device plugin. The check at line 252 will cause the function to skip adding new resources to the map, which defeats the purpose of this PR. The perNuma/numaData structure is built from perNUMAAllocatable which is initialized only at startup from GetAllocatableResources, so new resources won't be present there until the aggregator is recreated. This condition should be reconsidered or removed.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 3, 2024

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 3, 2024

k8s-ci-robot requested review from kad and marquiz October 3, 2024 18:23

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 3, 2024

ozhuraki force-pushed the update-resources branch from 001c2e7 to 230fc67 Compare October 7, 2024 18:41

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 7, 2024

PiotrProkop reviewed Oct 15, 2024

View reviewed changes

pkg/resourcemonitor/noderesourcesaggregator.go Outdated Show resolved Hide resolved

topology-updater: Track new resources not present during startup

0c8a8fa

Signed-off-by: Oleg Zhurakivskyy <[email protected]>

ozhuraki force-pushed the update-resources branch from 230fc67 to 0c8a8fa Compare December 10, 2024 15:23

ozhuraki changed the title ~~[WIP] topology-updater: Track new resources not present during startup~~ topology-updater: Track new resources not present during startup Dec 10, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 10, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2025

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 9, 2025

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 6, 2025

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 5, 2025

k8s-ci-robot closed this Oct 5, 2025

k8s-ci-robot reopened this Oct 6, 2025

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 13, 2025

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2026

ArangoGutierrez requested a review from Copilot January 15, 2026 11:00

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 15, 2026

Copilot started reviewing on behalf of ArangoGutierrez January 15, 2026 11:01 View session

Copilot AI reviewed Jan 15, 2026

View reviewed changes

		resMap, ok := noderesourceData.resourceID2NUMAID[resName]
		if !ok {

topology-updater: Track new resources not present during startup #1897

Are you sure you want to change the base?

topology-updater: Track new resources not present during startup #1897

Conversation

ozhuraki commented Oct 3, 2024

Uh oh!

k8s-ci-robot commented Oct 3, 2024

Uh oh!

netlify bot commented Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-nfd ready!

Uh oh!

ozhuraki commented Oct 3, 2024

Uh oh!

PiotrProkop commented Oct 7, 2024

Uh oh!

ozhuraki commented Oct 7, 2024

Uh oh!

Uh oh!

ozhuraki commented Dec 10, 2024

Uh oh!

k8s-triage-robot commented Mar 10, 2025

Uh oh!

k8s-triage-robot commented Apr 9, 2025

Uh oh!

marquiz commented May 7, 2025

Uh oh!

marquiz commented May 8, 2025

Uh oh!

PiotrProkop commented May 8, 2025

Uh oh!

k8s-triage-robot commented Aug 6, 2025

Uh oh!

k8s-triage-robot commented Sep 5, 2025

Uh oh!

k8s-triage-robot commented Oct 5, 2025

Uh oh!

k8s-ci-robot commented Oct 5, 2025

Uh oh!

marquiz commented Oct 6, 2025

Uh oh!

k8s-ci-robot commented Oct 6, 2025

Uh oh!

k8s-ci-robot commented Oct 6, 2025

Uh oh!

k8s-ci-robot commented Oct 6, 2025

Uh oh!

marquiz commented Oct 13, 2025

Uh oh!

k8s-triage-robot commented Jan 11, 2026

Uh oh!

ArangoGutierrez commented Jan 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

netlify bot commented Oct 3, 2024 •

edited

Loading