Skip to content

Conversation

@ArangoGutierrez
Copy link
Contributor

What this PR does / why we need it

This fixes an issue where nfd-topology-updater pods would crash repeatedly (CrashLoopBackOff) when the NodeResourceTopology CRD was not installed, providing an unhelpful error message.

Changes

  • Add waitForNodeResourceTopologyCRD() that waits with exponential backoff (5s to 60s max) for the CRD to become available
  • Move HTTP server startup earlier so health probes pass during CRD wait
  • Add RBAC permission for topology-updater to read CRDs
  • Handle Forbidden errors gracefully (skip wait if no permission)
  • Improve error message when NRT creation fails due to missing CRD
  • Update documentation with CRD requirements section

How it works

Pod starts → HTTP server starts (health OK) → Check CRD
                                                   ↓
                                            CRD missing?
                                            ↙         ↘
                                         Yes           No
                                          ↓             ↓
                                   Wait & retry    Continue normal operation
                                   (5s→10s→...→60s)

This handles:

  • Race conditions during deployment where CRD is created after topology-updater starts
  • System admin scenarios where CRD is installed separately
  • Graceful degradation when RBAC permissions are missing

Which issue(s) this PR fixes

Fixes #2145

Special notes for your reviewer

  • The HTTP server is now started earlier in the initialization sequence so that health probes pass while waiting for the CRD
  • RBAC permissions to read CRDs are added to both Helm chart and kustomize base
  • If the ServiceAccount lacks permission to check CRDs (Forbidden error), the wait is skipped and the actual NRT creation will fail with a descriptive error

Does this PR introduce a user-facing change?

nfd-topology-updater now waits for the NodeResourceTopology CRD to be available at startup instead of crashing, handling deployment race conditions gracefully.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jan 16, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ArangoGutierrez

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 16, 2026
@netlify
Copy link

netlify bot commented Jan 16, 2026

👷 Deploy Preview for kubernetes-sigs-nfd processing.

Name Link
🔨 Latest commit ff78d4b
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-nfd/deploys/696a6df64601b8000872923c

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 16, 2026
@ArangoGutierrez ArangoGutierrez force-pushed the fix/2145-topology-updater-crd-check branch from 5a3a51b to f3fd84a Compare January 16, 2026 16:30
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jan 16, 2026
@ArangoGutierrez ArangoGutierrez force-pushed the fix/2145-topology-updater-crd-check branch from f3fd84a to 378460c Compare January 16, 2026 16:37
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 16, 2026
@ArangoGutierrez ArangoGutierrez force-pushed the fix/2145-topology-updater-crd-check branch from 378460c to f18588f Compare January 16, 2026 16:44
This fixes an issue where nfd-topology-updater pods would crash repeatedly
(CrashLoopBackOff) when the NodeResourceTopology CRD was not installed,
providing an unhelpful error message.

Changes:
- Add waitForNodeResourceTopologyCRD() that waits with exponential backoff
  (5s to 60s max) for the CRD to become available
- Move HTTP server startup earlier so health probes pass during CRD wait
- Add RBAC permission for topology-updater to read CRDs
- Handle Forbidden errors gracefully (skip wait if no permission)
- Update documentation with CRD requirements section

This handles race conditions during deployment and scenarios where the
CRD is installed after the topology-updater pods start.
@ArangoGutierrez ArangoGutierrez force-pushed the fix/2145-topology-updater-crd-check branch from f18588f to ff78d4b Compare January 16, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Topology updater is failing to collect NUMA information

2 participants