Add support for TAS + Elasticworkloads #8580

sohankunkerkar · 2026-01-14T06:11:53Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

When both TopologyAwareScheduling and ElasticJobsViaWorkloadSlices feature gates are enabled, Kueue now preserves topology locality across workload slice transitions during job scaling.

Does this PR introduce a user-facing change?

Enable Topology Aware Scheduling (TAS) integration with ElasticJobsViaWorkloadSlices

netlify · 2026-01-14T06:11:59Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`97ffeea`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/696aa5936df4490007f03e67

Copilot

Pull request overview

This PR adds support for integrating Topology Aware Scheduling (TAS) with ElasticJobsViaWorkloadSlices. When both feature gates are enabled, Kueue preserves topology locality across workload slice transitions during job scaling operations.

Changes:

Adds JobNameAnnotation constant to provide stable pod identification across workload slice transitions
Implements topology assignment preservation logic in the scheduler to place scaled workloads in the same topology domain
Extends pod indexing with a new JobNameKey index for finding pods by parent job name
Updates controllers (TopologyUngater, NodeFailureReconciler) to use job-name-based pod lookups when both features are enabled

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
apis/kueue/v1beta2/topology_types.go	Adds JobNameAnnotation constant for stable pod identification
apis/kueue/v1beta1/topology_types.go	Adds JobNameAnnotation constant (v1beta1 version)
pkg/scheduler/flavorassigner/tas_flavorassigner.go	Extracts previous topology assignment from replaced workload slice
pkg/cache/scheduler/tas_flavor_snapshot.go	Implements topology domain preservation and freed capacity calculation for replacement workloads
pkg/controller/tas/topology_ungater.go	Updates pod lookup to use job-name index when ElasticJobsViaWorkloadSlices is enabled
pkg/controller/tas/node_failure_controller.go	Refactors to track workload info with job names for pod lookups
pkg/controller/tas/indexer/indexer.go	Adds new job-name index for pods
pkg/controller/jobframework/reconciler.go	Injects JobNameAnnotation on pods when both TAS and ElasticJobsViaWorkloadSlices are enabled
test/integration/singlecluster/tas/tas_test.go	Adds integration test for TAS with elastic workload slices
test/integration/singlecluster/tas/suite_test.go	Configures pod webhook in test suite
keps/77-dynamically-sized-jobs/README.md	Documents TAS integration with elastic jobs
keps/2724-topology-aware-scheduling/README.md	Documents elastic workload support in TAS

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/integration/singlecluster/tas/tas_test.go

apis/kueue/v1beta1/topology_types.go

pkg/controller/jobframework/reconciler.go

mimowo · 2026-01-15T15:57:41Z

pkg/controller/jobframework/reconciler.go

 }

+// getWorkloadSliceOriginName returns the original workload name in a replacement chain.
+func getWorkloadSliceOriginName(ctx context.Context, c client.Client, wl *kueue.Workload) string {


This function cannot depend on getting back the first workload in the chain, as it may be already deleted. I think we need to carry that information on all workloads

mimowo · 2026-01-15T17:19:48Z

pkg/controller/tas/node_failure_controller.go

 }

-// getWorkloadsOnNode gets all workloads that have the given node assigned in TAS topology assignment
-func (r *nodeFailureReconciler) getWorkloadsOnNode(ctx context.Context, nodeName string) (sets.Set[types.NamespacedName], error) {


I think you could keep the previous interface (returning sets.Set[types.NamespacedName]) to avoid the diff. This would improve readability of the PR.

mimowo · 2026-01-15T17:23:19Z

pkg/cache/scheduler/tas_flavor_snapshot.go

+			// new pods in the same topology domain as existing pods
+			var requiredDomain utiltas.TopologyDomainID
+			var freedUsage map[utiltas.TopologyDomainID]resources.Requests
+			if workers.PreviousTopologyAssignment != nil {


Please wrap the code checking for the ElasticJobsViaWorkloadSlices feature gate. I know this is true that currently it means that workers.PreviousTopologyAssignment != nil, but code evolves and over checking that will be harder. using the feature gate will also give a clear indication why this was introduced.

mimowo · 2026-01-15T17:24:48Z

pkg/controller/jobframework/reconciler.go

+	if features.Enabled(features.ElasticJobsViaWorkloadSlices) {
+		workloadSliceName = getWorkloadSliceOriginName(ctx, r.client, wl)
+	}
+	info, err := getPodSetsInfoFromStatus(ctx, r.client, wl, workloadSliceName)


This looks like code duplication with https://github.com/kubernetes-sigs/kueue/pull/8580/changes#diff-552ab657667de70900b502e2617bcb405fbf4203ff08b2a35d7c87f2d637a75bR1062-R1067. Can we please wrap this under the a common function, or even better just under pre-existing getPodSetsInfoFromStatus

mimowo · 2026-01-15T17:25:27Z

pkg/controller/tas/indexer/indexer.go

 		return fmt.Errorf("setting index pod workload: %w", err)
 	}

+	if err := indexer.IndexField(ctx, &corev1.Pod{}, WorkloadSliceNameKey, indexPodWorkloadSliceName); err != nil {


Only register the indexer if ElasticJobViaWorkloadSlices is enabled

pkg/controller/tas/node_failure_controller.go

pkg/controller/tas/topology_ungater.go

mimowo · 2026-01-15T17:28:24Z

test/integration/singlecluster/tas/tas_test.go

+			gomega.Expect(k8sClient.Create(ctx, ns2)).To(gomega.Succeed())
+			defer func() {
+				gomega.Expect(util.DeleteNamespace(ctx, k8sClient, ns2)).To(gomega.Succeed())
+			}()
+
+			localQueue2 := utiltestingapi.MakeLocalQueue("local-queue", ns2.Name).ClusterQueue(clusterQueue.Name).Obj()
+			util.MustCreate(ctx, k8sClient, localQueue2)
+			defer func() {
+				gomega.Expect(util.DeleteObject(ctx, k8sClient, localQueue2)).To(gomega.Succeed())
+			}()


Instead of that Please cleanup all tests in AfterEach by calling DeleteNamespace

apis/kueue/v1beta1/topology_types.go

apis/kueue/v1beta2/topology_types.go

mimowo

Thank you 👍 This looks very well. The comments are mostly around:

fixing the getWorkloadSliceOrigin to avoid relying on all workloads to stay
removing code duplication
removing unnecessary changes to minimize the diff for readabiility
making sure all places modified are behind the FG

Other than that the changes look straightforward, and the feature is super valueable, while still in alpha. So I'm supportive to include to 0.16.

apis/kueue/v1beta1/topology_types.go

pkg/scheduler/flavorassigner/tas_flavorassigner.go

mimowo

I think there are a couple of potential issues here. First, I think we cannot move already running Pods, so we need to ensure the new assignment is a superset of the previous assignment. Looking at the code this is not necessarily the case.

Let's list some of the possible cases:

unconstrained annotation used, looks like simplest, we basically run the LeastCapacity algo for the delta of Pods
preferred annotation used - this looks generally harder, but because ideally we would place the Pods close to the previous, but since this is "best effort" we could probably use "BestFit" for the new delta of Pods
required annotation used, say required-topology=block - this might be tricky to implement, really. This is similar problem as we solve in Node Hot Swap, but for multiple Pods. We could probably readily use the algo for one Pod, allowing scale ups per Pod, but this is generally tricky.

We don't need to solve them all while in alpha. TBH I prefer the PR to just give direction and be smaller, rather than big, but breaking :). I would rather just validate against the unsupported cases. We can easily gradually keep adding support while in alpha. Plus we test the scale down. Ideally both with e2e test.

So I'm thinking for simplicity maybe we just start with "unconstained" in 0.16. This could be already powerful to combine ElasticJobs with TAS to eliminate the quota fragmentation issues. WDYT?

pkg/cache/scheduler/tas_flavor_snapshot.go

k8s-ci-robot · 2026-01-16T16:41:35Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sohankunkerkar
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mimowo · 2026-01-16T16:42:08Z

keps/2724-topology-aware-scheduling/README.md

+TAS supports ElasticJobsViaWorkloadSlices. When both feature gates are enabled, TAS attempts to
+preserve topology locality across workload slice transitions during job scaling for workloads with
+**required** topology requests. Pods are indexed by a stable `kueue.x-k8s.io/workload-slice-name`
+annotation (using the origin workload name in the slice chain), and the `TopologyAssignment` from
+the old slice is used to constrain placement of new pods to the same topology domain. If the
+previous assignment is stale (e.g., node unhealthy), TAS falls back to normal placement. Preferred
+topology requests do not currently benefit from this preservation.


Let's just refer to the ElasticWorkloads KEP, and discuss the algo etc. under one place.

mimowo · 2026-01-16T16:42:53Z

keps/77-dynamically-sized-jobs/README.md

+)
+```
+
+#### Topology Assignment Preservation


Suggested change

#### Topology Assignment Preservation

#### Scale Up

mimowo · 2026-01-16T16:44:47Z

keps/77-dynamically-sized-jobs/README.md

+the `TopologyAssignment` from the old slice and passes it to TAS. TAS attempts to place
+new pods in the same topology domain as existing pods. If the previous assignment is
+stale (e.g., nodes are unhealthy), TAS falls back to normal placement.
+


Let's clarify that for the first interation of the integration in 0.16 we will focus on supporting only the "unconstrained" TAS for scale up. Supporting scale down and other modes "preferred" and "required" will be subject for future iterations of the feature.

Please also add the "Full integration with Topology-Aware Scheduling to graduation criteria for Beta, or clear validation against unsupported options"

keps/77-dynamically-sized-jobs/README.md

mimowo

I think for readability I would suggest to actually de-couple the KEP update and the implementation to separate PRs. These can be also reviewed by other folks.

sohankunkerkar · 2026-01-16T20:22:20Z

I think for readability I would suggest to actually de-couple the KEP update and the implementation to separate PRs. These can be also reviewed by other folks.

#8642

This change enables Topology-Aware Scheduling (TAS) to work correctly with the ElasticJobsViaWorkloadSlices feature, which allows jobs to dynamically scale via workload slices. When ElasticJobsViaWorkloadSlices is enabled and a job name is present, the job annotation takes precedence over workload annotation for pod lookups, ensuring all pods across slices are found correctly. Signed-off-by: Sohan Kunkerkar <[email protected]>

Copilot AI review requested due to automatic review settings January 14, 2026 06:11

k8s-ci-robot requested review from PBundyra and mimowo January 14, 2026 06:12

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 14, 2026

Copilot started reviewing on behalf of sohankunkerkar January 14, 2026 06:12 View session

Copilot AI reviewed Jan 14, 2026

View reviewed changes

test/integration/singlecluster/tas/tas_test.go Show resolved Hide resolved

test/integration/singlecluster/tas/tas_test.go Outdated Show resolved Hide resolved

mimowo reviewed Jan 14, 2026

View reviewed changes

apis/kueue/v1beta1/topology_types.go Outdated Show resolved Hide resolved

mimowo reviewed Jan 14, 2026

View reviewed changes

pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved

sohankunkerkar force-pushed the support-elasticworkloads branch from c82176e to af27221 Compare January 14, 2026 16:38

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 14, 2026

sohankunkerkar force-pushed the support-elasticworkloads branch from af27221 to c38cb65 Compare January 14, 2026 16:47

sohankunkerkar changed the title ~~[WIP] Add support for TAS + Elasticworkloads~~ Add support for TAS + Elasticworkloads Jan 15, 2026

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 15, 2026