Skip to content

Conversation

@sohankunkerkar
Copy link
Member

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #8160

Special notes for your reviewer:

When both TopologyAwareScheduling and ElasticJobsViaWorkloadSlices feature gates are enabled, Kueue now preserves topology locality across workload slice transitions during job scaling.

Does this PR introduce a user-facing change?

Enable Topology Aware Scheduling (TAS) integration with ElasticJobsViaWorkloadSlices

Copilot AI review requested due to automatic review settings January 14, 2026 06:11
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Jan 14, 2026
@netlify
Copy link

netlify bot commented Jan 14, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 97ffeea
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/696aa5936df4490007f03e67

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 14, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for integrating Topology Aware Scheduling (TAS) with ElasticJobsViaWorkloadSlices. When both feature gates are enabled, Kueue preserves topology locality across workload slice transitions during job scaling operations.

Changes:

  • Adds JobNameAnnotation constant to provide stable pod identification across workload slice transitions
  • Implements topology assignment preservation logic in the scheduler to place scaled workloads in the same topology domain
  • Extends pod indexing with a new JobNameKey index for finding pods by parent job name
  • Updates controllers (TopologyUngater, NodeFailureReconciler) to use job-name-based pod lookups when both features are enabled

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
apis/kueue/v1beta2/topology_types.go Adds JobNameAnnotation constant for stable pod identification
apis/kueue/v1beta1/topology_types.go Adds JobNameAnnotation constant (v1beta1 version)
pkg/scheduler/flavorassigner/tas_flavorassigner.go Extracts previous topology assignment from replaced workload slice
pkg/cache/scheduler/tas_flavor_snapshot.go Implements topology domain preservation and freed capacity calculation for replacement workloads
pkg/controller/tas/topology_ungater.go Updates pod lookup to use job-name index when ElasticJobsViaWorkloadSlices is enabled
pkg/controller/tas/node_failure_controller.go Refactors to track workload info with job names for pod lookups
pkg/controller/tas/indexer/indexer.go Adds new job-name index for pods
pkg/controller/jobframework/reconciler.go Injects JobNameAnnotation on pods when both TAS and ElasticJobsViaWorkloadSlices are enabled
test/integration/singlecluster/tas/tas_test.go Adds integration test for TAS with elastic workload slices
test/integration/singlecluster/tas/suite_test.go Configures pod webhook in test suite
keps/77-dynamically-sized-jobs/README.md Documents TAS integration with elastic jobs
keps/2724-topology-aware-scheduling/README.md Documents elastic workload support in TAS

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sohankunkerkar sohankunkerkar force-pushed the support-elasticworkloads branch from c82176e to af27221 Compare January 14, 2026 16:38
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 14, 2026
@sohankunkerkar sohankunkerkar force-pushed the support-elasticworkloads branch from af27221 to c38cb65 Compare January 14, 2026 16:47
@sohankunkerkar sohankunkerkar changed the title [WIP] Add support for TAS + Elasticworkloads Add support for TAS + Elasticworkloads Jan 15, 2026
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 15, 2026
}

// getWorkloadSliceOriginName returns the original workload name in a replacement chain.
func getWorkloadSliceOriginName(ctx context.Context, c client.Client, wl *kueue.Workload) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function cannot depend on getting back the first workload in the chain, as it may be already deleted. I think we need to carry that information on all workloads

}

// getWorkloadsOnNode gets all workloads that have the given node assigned in TAS topology assignment
func (r *nodeFailureReconciler) getWorkloadsOnNode(ctx context.Context, nodeName string) (sets.Set[types.NamespacedName], error) {
Copy link
Contributor

@mimowo mimowo Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could keep the previous interface (returning sets.Set[types.NamespacedName]) to avoid the diff. This would improve readability of the PR.

// new pods in the same topology domain as existing pods
var requiredDomain utiltas.TopologyDomainID
var freedUsage map[utiltas.TopologyDomainID]resources.Requests
if workers.PreviousTopologyAssignment != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wrap the code checking for the ElasticJobsViaWorkloadSlices feature gate. I know this is true that currently it means that workers.PreviousTopologyAssignment != nil, but code evolves and over checking that will be harder. using the feature gate will also give a clear indication why this was introduced.

if features.Enabled(features.ElasticJobsViaWorkloadSlices) {
workloadSliceName = getWorkloadSliceOriginName(ctx, r.client, wl)
}
info, err := getPodSetsInfoFromStatus(ctx, r.client, wl, workloadSliceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like code duplication with https://github.com/kubernetes-sigs/kueue/pull/8580/changes#diff-552ab657667de70900b502e2617bcb405fbf4203ff08b2a35d7c87f2d637a75bR1062-R1067. Can we please wrap this under the a common function, or even better just under pre-existing getPodSetsInfoFromStatus

return fmt.Errorf("setting index pod workload: %w", err)
}

if err := indexer.IndexField(ctx, &corev1.Pod{}, WorkloadSliceNameKey, indexPodWorkloadSliceName); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only register the indexer if ElasticJobViaWorkloadSlices is enabled

Comment on lines 3955 to 3964
gomega.Expect(k8sClient.Create(ctx, ns2)).To(gomega.Succeed())
defer func() {
gomega.Expect(util.DeleteNamespace(ctx, k8sClient, ns2)).To(gomega.Succeed())
}()

localQueue2 := utiltestingapi.MakeLocalQueue("local-queue", ns2.Name).ClusterQueue(clusterQueue.Name).Obj()
util.MustCreate(ctx, k8sClient, localQueue2)
defer func() {
gomega.Expect(util.DeleteObject(ctx, k8sClient, localQueue2)).To(gomega.Succeed())
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of that Please cleanup all tests in AfterEach by calling DeleteNamespace

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 👍 This looks very well. The comments are mostly around:

  1. fixing the getWorkloadSliceOrigin to avoid relying on all workloads to stay
  2. removing code duplication
  3. removing unnecessary changes to minimize the diff for readabiility
  4. making sure all places modified are behind the FG

Other than that the changes look straightforward, and the feature is super valueable, while still in alpha. So I'm supportive to include to 0.16.

@sohankunkerkar sohankunkerkar force-pushed the support-elasticworkloads branch from c38cb65 to 09baa0b Compare January 15, 2026 18:38
@sohankunkerkar sohankunkerkar force-pushed the support-elasticworkloads branch from 09baa0b to 9e2f076 Compare January 15, 2026 18:46
@sohankunkerkar sohankunkerkar force-pushed the support-elasticworkloads branch 2 times, most recently from e2548c8 to 17d0a6c Compare January 15, 2026 19:02
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a couple of potential issues here. First, I think we cannot move already running Pods, so we need to ensure the new assignment is a superset of the previous assignment. Looking at the code this is not necessarily the case.

Let's list some of the possible cases:

  1. unconstrained annotation used, looks like simplest, we basically run the LeastCapacity algo for the delta of Pods
  2. preferred annotation used - this looks generally harder, but because ideally we would place the Pods close to the previous, but since this is "best effort" we could probably use "BestFit" for the new delta of Pods
  3. required annotation used, say required-topology=block - this might be tricky to implement, really. This is similar problem as we solve in Node Hot Swap, but for multiple Pods. We could probably readily use the algo for one Pod, allowing scale ups per Pod, but this is generally tricky.

We don't need to solve them all while in alpha. TBH I prefer the PR to just give direction and be smaller, rather than big, but breaking :). I would rather just validate against the unsupported cases. We can easily gradually keep adding support while in alpha. Plus we test the scale down. Ideally both with e2e test.

So I'm thinking for simplicity maybe we just start with "unconstained" in 0.16. This could be already powerful to combine ElasticJobs with TAS to eliminate the quota fragmentation issues. WDYT?

@sohankunkerkar sohankunkerkar force-pushed the support-elasticworkloads branch from 17d0a6c to 1496a3b Compare January 16, 2026 16:41
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sohankunkerkar
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines 1359 to 1365
TAS supports ElasticJobsViaWorkloadSlices. When both feature gates are enabled, TAS attempts to
preserve topology locality across workload slice transitions during job scaling for workloads with
**required** topology requests. Pods are indexed by a stable `kueue.x-k8s.io/workload-slice-name`
annotation (using the origin workload name in the slice chain), and the `TopologyAssignment` from
the old slice is used to constrain placement of new pods to the same topology domain. If the
previous assignment is stale (e.g., node unhealthy), TAS falls back to normal placement. Preferred
topology requests do not currently benefit from this preservation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just refer to the ElasticWorkloads KEP, and discuss the algo etc. under one place.

)
```

#### Topology Assignment Preservation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Topology Assignment Preservation
#### Scale Up

the `TopologyAssignment` from the old slice and passes it to TAS. TAS attempts to place
new pods in the same topology domain as existing pods. If the previous assignment is
stale (e.g., nodes are unhealthy), TAS falls back to normal placement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clarify that for the first interation of the integration in 0.16 we will focus on supporting only the "unconstrained" TAS for scale up. Supporting scale down and other modes "preferred" and "required" will be subject for future iterations of the feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add the "Full integration with Topology-Aware Scheduling to graduation criteria for Beta, or clear validation against unsupported options"

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for readability I would suggest to actually de-couple the KEP update and the implementation to separate PRs. These can be also reviewed by other folks.

@sohankunkerkar sohankunkerkar force-pushed the support-elasticworkloads branch from 1496a3b to 148eb80 Compare January 16, 2026 20:21
@sohankunkerkar
Copy link
Member Author

I think for readability I would suggest to actually de-couple the KEP update and the implementation to separate PRs. These can be also reviewed by other folks.

#8642

This change enables Topology-Aware Scheduling (TAS) to work correctly
with the ElasticJobsViaWorkloadSlices feature, which allows jobs to
dynamically scale via workload slices. When ElasticJobsViaWorkloadSlices
is enabled and a job name is present, the job annotation takes precedence
over workload annotation for pod lookups, ensuring all pods across slices
are found correctly.

Signed-off-by: Sohan Kunkerkar <[email protected]>
@sohankunkerkar sohankunkerkar force-pushed the support-elasticworkloads branch 2 times, most recently from 148eb80 to 97ffeea Compare January 16, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TAS: support for ElasticWorkloads

3 participants