Skip to content

Conversation

@Xunli-Yang
Copy link

Add KEP: NFD image compatibility scheduler proposal.

What the proposal does?
Building upon the first phase of KEP-1845 Proposal, which completed node compatibility validation. This proposal introduces a compatibility scheduling plugin. The compatibility scheduler plugin automatically analyzes the compatibility requirements of container images, filters suitable nodes for scheduling, and ensures that containers run on compatible nodes.

Special notes for reviewer:
Based on the discussions on node-feature-discovery Slack channel, this proposal has presented three solutions and intends to get consensus on the implementation direction.
Co-authored-by: @ChaoyiHuang

@netlify
Copy link

netlify bot commented Dec 27, 2025

Deploy Preview for kubernetes-sigs-nfd ready!

Name Link
🔨 Latest commit f6ebba8
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-nfd/deploys/695487f811e4f1000850980d
😎 Deploy Preview https://deploy-preview-2403--kubernetes-sigs-nfd.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 27, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @Xunli-Yang. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 27, 2025
@ArangoGutierrez ArangoGutierrez self-assigned this Dec 27, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds KEP-2403, which proposes a compatibility scheduler plugin for Node Feature Discovery (NFD). Building on KEP-1845 (which established node compatibility validation), this proposal introduces automated scheduling capabilities to ensure pods are scheduled on nodes compatible with their container image requirements.

Key changes:

  • Introduces three alternative solution designs for implementing image compatibility scheduling
  • Proposes an ImageCompatibilityPlugin that leverages NodeFeatureGroup CRs to filter compatible nodes
  • Presents performance tradeoffs from basic validation (Solution 1) to optimized large-scale approaches (Solutions 2 and 3)

Reviewed changes

Copilot reviewed 1 out of 4 changed files in this pull request and generated 26 comments.

File Description
enhancements/2403-nfd-image-compatibility-scheduler/README.md Complete KEP document proposing three solutions for image compatibility scheduling with detailed workflows, merits/demerits analysis, and test plans
enhancements/2403-nfd-image-compatibility-scheduler/solution1.png Architectural diagram illustrating the basic NodeFeatureGroup check approach
enhancements/2403-nfd-image-compatibility-scheduler/solution2.png Architectural diagram showing the SQLite database caching solution
enhancements/2403-nfd-image-compatibility-scheduler/solution3.png Architectural diagram depicting the node pre-grouping optimization strategy

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 28 to 51
1. **CR Creation and Update(Prefilter Phase):** When a pod with specific image requirements enters the scheduling queue, scheduler plugin fetches the attached OCI Artifact. It extracts the compatibility metadata (e.g., required kernel features) and **instantly creates a new `NodeFeatureGroup` CR**. This CR's specification defines the dynamic compatibility rules.

The `update NodeFeatureGroup` operation evaluates **all nodes in the cluster** against the CR's specification rules and updates the CR's `status` field with the list of nodes that satisfy the compatibility demands.

```yaml
apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureGroup
metadata:
name: node-feature-group-example
spec:
featureGroupRules:
- name: "kernel version"
matchFeatures:
- feature: kernel.version
matchExpressions:
major: {op: In, value: ["6"]}
status:
nodes:
- name: node-1
- name: node-2
- name: node-3
```

2. **Node Filtering (Filter Phase):** In the scheduler's final filter phase, retrieve the dynamically created `NodeFeatureGroup` CR and filters the candidate nodes, ensuring that only nodes listed in the CR's `status` are considered compatible.
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing critical information about lifecycle management. The proposal mentions creating NodeFeatureGroup CRs dynamically during scheduling but doesn't address cleanup. When and how are these ephemeral CRs deleted? Without proper cleanup, they could accumulate and cause resource exhaustion. This is particularly important for Solution 1 and potentially Solution 3, which create CRs per scheduling request.

Copilot uses AI. Check for mistakes.

The process involves three main phases:

1. **Initial Cluster Grouping:** In the cluster preparation stage, administrator should divide the cluster nodes into several groups by `NodeFeatureGroup`. Multiple `NodeFeatureGroup` Custom Resources (CRs) are created declaratively, each defining a grouping rule. Their status is populated with all matching nodes, completing the pre-grouping setup.
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing important implementation detail. The proposal mentions that "administrator should divide the cluster nodes into several groups by NodeFeatureGroup" but doesn't provide guidance on how to determine appropriate grouping rules or how many groups are optimal. Additionally, it doesn't address what happens when new nodes are added to the cluster - how are they assigned to groups? These are critical considerations for the practical implementation of this solution.

Copilot uses AI. Check for mistakes.
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Xunli-Yang
Once this PR has been reviewed and has the lgtm label, please ask for approval from arangogutierrez. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Lohon0
Copy link

Lohon0 commented Jan 3, 2026

Hi

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing sections:

  • Risks and Mitigations
  • Graduation Criteria
  • Implementation Timeline

Copy link
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Xunli-Yang and @ChaoyiHuang for this comprehensive proposal! This is exactly the kind of Phase 2 work we need to make NFD image compatibility production-ready.

/ok-to-test

Feedback

Preferred Direction: Solution 3 (Node Pre-Grouping)

I'm leaning toward Solution 3 for the following reasons:

  1. Aligns with real-world cluster management - Large-scale operators already organize nodes into pools/groups based on hardware characteristics. This solution leverages existing practices rather than fighting against them.

  2. O(G) vs O(N) is critical at scale - Evaluating 10 representative nodes vs 10,000 individual nodes is the difference between sub-millisecond and multi-second scheduling latency.

  3. Simpler architecture - Unlike Solution 2 (SQLite), this doesn't require significant infrastructure changes to NFD master. The complexity is in the grouping strategy, not new storage backends.

  4. Progressive path - We could start with Solution 1 as an MVP for small clusters, then add Solution 3 optimizations for scale. Solution 2 with SQL isn't necessarily over-engineered, but introducing a SQL database into NFD feels like a project on its own and out of scope for this proposal.

Questions

  1. Scheduler plugin location: Have you considered building this as part of kubernetes-sigs/scheduler-plugins? That's the standard home for custom scheduler plugins and would give us the scheduling framework integration for free.

  2. NFG lifecycle management: What's the cleanup strategy for ephemeral NodeFeatureGroup CRs created during scheduling? Do they persist for caching, or are they garbage collected?

  3. Group homogeneity enforcement: For Solution 3, how do we validate/enforce that nodes within a pre-group are actually homogeneous? What happens if a node's features drift?

  4. Failure modes: What happens if:

    • The OCI artifact fetch fails during Prefilter?
    • The NFG status is stale or the controller is slow to update?
    • No groups match the compatibility requirements?

Missing KEP Sections

To align with standard KEP format, could you add:

  • Risks and Mitigations
  • Graduation Criteria (Alpha → Beta → GA)
  • Implementation Timeline / Milestones
  • Alternatives Considered (e.g., why not use node affinity directly?)

Great work on the diagrams - they really help visualize the three approaches. Looking forward to discussing this in the next community meeting!

cc @marquiz @kad

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 15, 2026
@Xunli-Yang
Copy link
Author

Thanks @ArangoGutierrez, very valuable views for us. Aggree with that, solution 3 (Node Pre-grouping) is also what we'd like to recommend. As well the progressive path, we are working on a demo for the solution 1(as the base of solution 3). Just like you said, starting an MVP for small clusters can be the target of the first stage.

Q&A

1.Scheduler plugin location: Have you considered building this as part of [kubernetes-sigs/scheduler-plugins](https://github.com/kubernetes-sigs/scheduler-plugins)? That's the standard home for custom scheduler plugins and would give us the scheduling framework integration for free.

Yes, Our target is to integrate it into Kubernetes-sigs/scheduler-plugins as a common scheduling plugin. We expect to initially incubate in nfd sig.

2.NFG lifecycle management: What's the cleanup strategy for ephemeral NodeFeatureGroup CRs created during scheduling? Do they persist for caching, or are they garbage collected?

Manually created CRs are long-lived, but the temporary CRs from the scheduler will be garbage-collected with a TTL. All these details will be added to the proposal.

Group homogeneity enforcement: For Solution 3, how do we validate/enforce that nodes within a pre-group are actually homogeneous? What happens if a node's features drift?

The idea is that administrators are responsible for ensuring homogeneity when they define the pre-groups. It's manatory for cluster administrators and ups to the group strategy. And if a node drifts later, the pre-groups will be updated due to the update of NodeFeatureGroup. So there is no influence, only when drift happens after the scheduling process, the scheduled pods will not effect until next schedule time. --We'll need to add a monitoring mechanism to watch for drifted nodes, alert, and have administrators trigger rescheduling.

Failure modes: What happens if:

  • The OCI artifact fetch fails during Prefilter?
  • The NFG status is stale or the controller is slow to update?
  • No groups match the compatibility requirements?
  • When the OCI artifact fetch fails during Prefilter, trigger the degradation strategy. Continue scheduling but log a warning. It's same in the no-compatibility-demand images.
  • If the NFG status is stale or the controller is slow to update, we can break this down into two scenarios:
    • If there are no schedulable nodes at the moment, once the state updates, retrying the scheduling should eventually succeed.
    • If the NFG state is stale and a pod has already been scheduled onto a node whose actual state has changed — that’s a really good question. We may avoid this by adding a last-second validation of the node right before the final binding step in the scheduler.
  • If no groups match the compatibility requirements, the schedule process will retry until fail and log an error.

Good question. We’ll add the details of the solution along with the missing KEP sections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants