Add NFD image compatibility scheduler proposal. #2403

Xunli-Yang · 2025-12-27T03:14:15Z

Add KEP: NFD image compatibility scheduler proposal.

What the proposal does?
Building upon the first phase of KEP-1845 Proposal, which completed node compatibility validation. This proposal introduces a compatibility scheduling plugin. The compatibility scheduler plugin automatically analyzes the compatibility requirements of container images, filters suitable nodes for scheduling, and ensures that containers run on compatible nodes.

Special notes for reviewer:
Based on the discussions on node-feature-discovery Slack channel, this proposal has presented three solutions and intends to get consensus on the implementation direction.
Co-authored-by: @ChaoyiHuang

netlify · 2025-12-27T03:14:22Z

✅ Deploy Preview for kubernetes-sigs-nfd ready!

Name	Link
🔨 Latest commit	`f6ebba8`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-nfd/deploys/695487f811e4f1000850980d
😎 Deploy Preview	https://deploy-preview-2403--kubernetes-sigs-nfd.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-12-27T03:14:26Z

Hi @Xunli-Yang. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copilot

Pull request overview

This PR adds KEP-2403, which proposes a compatibility scheduler plugin for Node Feature Discovery (NFD). Building on KEP-1845 (which established node compatibility validation), this proposal introduces automated scheduling capabilities to ensure pods are scheduled on nodes compatible with their container image requirements.

Key changes:

Introduces three alternative solution designs for implementing image compatibility scheduling
Proposes an ImageCompatibilityPlugin that leverages NodeFeatureGroup CRs to filter compatible nodes
Presents performance tradeoffs from basic validation (Solution 1) to optimized large-scale approaches (Solutions 2 and 3)

Reviewed changes

Copilot reviewed 1 out of 4 changed files in this pull request and generated 26 comments.

File	Description
enhancements/2403-nfd-image-compatibility-scheduler/README.md	Complete KEP document proposing three solutions for image compatibility scheduling with detailed workflows, merits/demerits analysis, and test plans
enhancements/2403-nfd-image-compatibility-scheduler/solution1.png	Architectural diagram illustrating the basic NodeFeatureGroup check approach
enhancements/2403-nfd-image-compatibility-scheduler/solution2.png	Architectural diagram showing the SQLite database caching solution
enhancements/2403-nfd-image-compatibility-scheduler/solution3.png	Architectural diagram depicting the node pre-grouping optimization strategy

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

enhancements/2403-nfd-image-compatibility-scheduler/README.md

Copilot · 2025-12-27T07:57:18Z

enhancements/2403-nfd-image-compatibility-scheduler/README.md

+1. **CR Creation and Update(Prefilter Phase):** When a pod with specific image requirements enters the scheduling queue, scheduler plugin fetches the attached OCI Artifact. It extracts the compatibility metadata (e.g., required kernel features) and **instantly creates a new `NodeFeatureGroup` CR**. This CR's specification defines the dynamic compatibility rules.
+
+   The `update NodeFeatureGroup` operation evaluates **all nodes in the cluster** against the CR's specification rules and updates the CR's `status` field with the list of nodes that satisfy the compatibility demands.
+
+   ```yaml
+   apiVersion: nfd.k8s-sigs.io/v1alpha1
+   kind: NodeFeatureGroup
+   metadata:
+     name: node-feature-group-example
+   spec:
+     featureGroupRules:
+       - name: "kernel version"
+         matchFeatures:
+           - feature: kernel.version
+             matchExpressions:
+               major: {op: In, value: ["6"]}
+   status:
+     nodes:
+       - name: node-1
+       - name: node-2
+       - name: node-3
+   ```
+
+2. **Node Filtering (Filter Phase):** In the scheduler's final filter phase, retrieve the dynamically created `NodeFeatureGroup` CR and filters the candidate nodes, ensuring that only nodes listed in the CR's `status` are considered compatible.


Missing critical information about lifecycle management. The proposal mentions creating NodeFeatureGroup CRs dynamically during scheduling but doesn't address cleanup. When and how are these ephemeral CRs deleted? Without proper cleanup, they could accumulate and cause resource exhaustion. This is particularly important for Solution 1 and potentially Solution 3, which create CRs per scheduling request.

Copilot · 2025-12-27T07:57:18Z

enhancements/2403-nfd-image-compatibility-scheduler/README.md

+
+The process involves three main phases:
+
+1. **Initial Cluster Grouping:** In the cluster preparation stage, administrator should divide the cluster nodes into several groups by `NodeFeatureGroup`. Multiple `NodeFeatureGroup` Custom Resources (CRs) are created declaratively, each defining a grouping rule. Their status is populated with all matching nodes, completing the pre-grouping setup.


Missing important implementation detail. The proposal mentions that "administrator should divide the cluster nodes into several groups by NodeFeatureGroup" but doesn't provide guidance on how to determine appropriate grouping rules or how many groups are optimal. Additionally, it doesn't address what happens when new nodes are added to the cluster - how are they assigned to groups? These are critical considerations for the practical implementation of this solution.

enhancements/2403-nfd-image-compatibility-scheduler/README.md

k8s-ci-robot · 2025-12-29T07:13:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Xunli-Yang
Once this PR has been reviewed and has the lgtm label, please ask for approval from arangogutierrez. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Co-authored-by: Joe Huang <[email protected]>

Lohon0 · 2026-01-03T06:20:15Z

Hi

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

ArangoGutierrez

Missing sections:

Risks and Mitigations
Graduation Criteria
Implementation Timeline

ArangoGutierrez

Thanks @Xunli-Yang and @ChaoyiHuang for this comprehensive proposal! This is exactly the kind of Phase 2 work we need to make NFD image compatibility production-ready.

/ok-to-test

Feedback

Preferred Direction: Solution 3 (Node Pre-Grouping)

I'm leaning toward Solution 3 for the following reasons:

Aligns with real-world cluster management - Large-scale operators already organize nodes into pools/groups based on hardware characteristics. This solution leverages existing practices rather than fighting against them.
O(G) vs O(N) is critical at scale - Evaluating 10 representative nodes vs 10,000 individual nodes is the difference between sub-millisecond and multi-second scheduling latency.
Simpler architecture - Unlike Solution 2 (SQLite), this doesn't require significant infrastructure changes to NFD master. The complexity is in the grouping strategy, not new storage backends.
Progressive path - We could start with Solution 1 as an MVP for small clusters, then add Solution 3 optimizations for scale. Solution 2 with SQL isn't necessarily over-engineered, but introducing a SQL database into NFD feels like a project on its own and out of scope for this proposal.

Questions

Scheduler plugin location: Have you considered building this as part of kubernetes-sigs/scheduler-plugins? That's the standard home for custom scheduler plugins and would give us the scheduling framework integration for free.
NFG lifecycle management: What's the cleanup strategy for ephemeral NodeFeatureGroup CRs created during scheduling? Do they persist for caching, or are they garbage collected?
Group homogeneity enforcement: For Solution 3, how do we validate/enforce that nodes within a pre-group are actually homogeneous? What happens if a node's features drift?
Failure modes: What happens if:
- The OCI artifact fetch fails during Prefilter?
- The NFG status is stale or the controller is slow to update?
- No groups match the compatibility requirements?

Missing KEP Sections

To align with standard KEP format, could you add:

Risks and Mitigations
Graduation Criteria (Alpha → Beta → GA)
Implementation Timeline / Milestones
Alternatives Considered (e.g., why not use node affinity directly?)

Great work on the diagrams - they really help visualize the three approaches. Looking forward to discussing this in the next community meeting!

cc @marquiz @kad

Xunli-Yang · 2026-01-16T08:27:41Z

Thanks @ArangoGutierrez, very valuable views for us. Aggree with that, solution 3 (Node Pre-grouping) is also what we'd like to recommend. As well the progressive path, we are working on a demo for the solution 1(as the base of solution 3). Just like you said, starting an MVP for small clusters can be the target of the first stage.

Q&A

1.Scheduler plugin location: Have you considered building this as part of [kubernetes-sigs/scheduler-plugins](https://github.com/kubernetes-sigs/scheduler-plugins)? That's the standard home for custom scheduler plugins and would give us the scheduling framework integration for free.

Yes, Our target is to integrate it into Kubernetes-sigs/scheduler-plugins as a common scheduling plugin. We expect to initially incubate in nfd sig.

2.NFG lifecycle management: What's the cleanup strategy for ephemeral NodeFeatureGroup CRs created during scheduling? Do they persist for caching, or are they garbage collected?

Manually created CRs are long-lived, but the temporary CRs from the scheduler will be garbage-collected with a TTL. All these details will be added to the proposal.

Group homogeneity enforcement: For Solution 3, how do we validate/enforce that nodes within a pre-group are actually homogeneous? What happens if a node's features drift?

The idea is that administrators are responsible for ensuring homogeneity when they define the pre-groups. It's manatory for cluster administrators and ups to the group strategy. And if a node drifts later, the pre-groups will be updated due to the update of NodeFeatureGroup. So there is no influence, only when drift happens after the scheduling process, the scheduled pods will not effect until next schedule time. --We'll need to add a monitoring mechanism to watch for drifted nodes, alert, and have administrators trigger rescheduling.

Failure modes: What happens if:

The OCI artifact fetch fails during Prefilter?

The NFG status is stale or the controller is slow to update?

No groups match the compatibility requirements?

When the OCI artifact fetch fails during Prefilter, trigger the degradation strategy. Continue scheduling but log a warning. It's same in the no-compatibility-demand images.
If the NFG status is stale or the controller is slow to update, we can break this down into two scenarios:
- If there are no schedulable nodes at the moment, once the state updates, retrying the scheduling should eventually succeed.
- If the NFG state is stale and a pod has already been scheduled onto a node whose actual state has changed — that’s a really good question. We may avoid this by adding a last-second validation of the node right before the final binding step in the scheduler.
If no groups match the compatibility requirements, the schedule process will retry until fail and log an error.

Good question. We’ll add the details of the solution along with the missing KEP sections.

k8s-ci-robot requested review from fmuyassarov and jjacobelli December 27, 2025 03:14

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 27, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 27, 2025

ArangoGutierrez requested review from ArangoGutierrez, Copilot and marquiz December 27, 2025 07:52

ArangoGutierrez self-assigned this Dec 27, 2025

Copilot started reviewing on behalf of ArangoGutierrez December 27, 2025 07:52 View session

Copilot AI reviewed Dec 27, 2025

View reviewed changes

Xunli-Yang force-pushed the master branch from b3092b9 to b4fe4e7 Compare December 29, 2025 07:13

Add NFD image compatibility scheduler proposal.

f6ebba8

Co-authored-by: Joe Huang <[email protected]>

Xunli-Yang force-pushed the master branch from b4fe4e7 to f6ebba8 Compare December 31, 2025 02:18

ArangoGutierrez requested a review from Copilot January 14, 2026 10:38

Copilot AI reviewed Jan 14, 2026

View reviewed changes

ArangoGutierrez requested changes Jan 15, 2026

View reviewed changes

ArangoGutierrez reviewed Jan 15, 2026

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 15, 2026


		The process involves three main phases:

		1. Initial Cluster Grouping: In the cluster preparation stage, administrator should divide the cluster nodes into several groups by `NodeFeatureGroup`. Multiple `NodeFeatureGroup` Custom Resources (CRs) are created declaratively, each defining a grouping rule. Their status is populated with all matching nodes, completing the pre-grouping setup.

Add NFD image compatibility scheduler proposal. #2403

Are you sure you want to change the base?

Add NFD image compatibility scheduler proposal. #2403

Conversation

Xunli-Yang commented Dec 27, 2025

Uh oh!

netlify bot commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-nfd ready!

Uh oh!

k8s-ci-robot commented Dec 27, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

k8s-ci-robot commented Dec 29, 2025

Uh oh!

Lohon0 commented Jan 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Feedback

Preferred Direction: Solution 3 (Node Pre-Grouping)

Questions

Missing KEP Sections

Uh oh!

Xunli-Yang commented Jan 16, 2026

Q&A

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

netlify bot commented Dec 27, 2025 •

edited

Loading