-
Notifications
You must be signed in to change notification settings - Fork 508
Add support for TAS + Elasticworkloads #8580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add support for TAS + Elasticworkloads #8580
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for integrating Topology Aware Scheduling (TAS) with ElasticJobsViaWorkloadSlices. When both feature gates are enabled, Kueue preserves topology locality across workload slice transitions during job scaling operations.
Changes:
- Adds
JobNameAnnotationconstant to provide stable pod identification across workload slice transitions - Implements topology assignment preservation logic in the scheduler to place scaled workloads in the same topology domain
- Extends pod indexing with a new
JobNameKeyindex for finding pods by parent job name - Updates controllers (TopologyUngater, NodeFailureReconciler) to use job-name-based pod lookups when both features are enabled
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| apis/kueue/v1beta2/topology_types.go | Adds JobNameAnnotation constant for stable pod identification |
| apis/kueue/v1beta1/topology_types.go | Adds JobNameAnnotation constant (v1beta1 version) |
| pkg/scheduler/flavorassigner/tas_flavorassigner.go | Extracts previous topology assignment from replaced workload slice |
| pkg/cache/scheduler/tas_flavor_snapshot.go | Implements topology domain preservation and freed capacity calculation for replacement workloads |
| pkg/controller/tas/topology_ungater.go | Updates pod lookup to use job-name index when ElasticJobsViaWorkloadSlices is enabled |
| pkg/controller/tas/node_failure_controller.go | Refactors to track workload info with job names for pod lookups |
| pkg/controller/tas/indexer/indexer.go | Adds new job-name index for pods |
| pkg/controller/jobframework/reconciler.go | Injects JobNameAnnotation on pods when both TAS and ElasticJobsViaWorkloadSlices are enabled |
| test/integration/singlecluster/tas/tas_test.go | Adds integration test for TAS with elastic workload slices |
| test/integration/singlecluster/tas/suite_test.go | Configures pod webhook in test suite |
| keps/77-dynamically-sized-jobs/README.md | Documents TAS integration with elastic jobs |
| keps/2724-topology-aware-scheduling/README.md | Documents elastic workload support in TAS |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
c82176e to
af27221
Compare
af27221 to
c38cb65
Compare
| } | ||
|
|
||
| // getWorkloadSliceOriginName returns the original workload name in a replacement chain. | ||
| func getWorkloadSliceOriginName(ctx context.Context, c client.Client, wl *kueue.Workload) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function cannot depend on getting back the first workload in the chain, as it may be already deleted. I think we need to carry that information on all workloads
| } | ||
|
|
||
| // getWorkloadsOnNode gets all workloads that have the given node assigned in TAS topology assignment | ||
| func (r *nodeFailureReconciler) getWorkloadsOnNode(ctx context.Context, nodeName string) (sets.Set[types.NamespacedName], error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could keep the previous interface (returning sets.Set[types.NamespacedName]) to avoid the diff. This would improve readability of the PR.
| // new pods in the same topology domain as existing pods | ||
| var requiredDomain utiltas.TopologyDomainID | ||
| var freedUsage map[utiltas.TopologyDomainID]resources.Requests | ||
| if workers.PreviousTopologyAssignment != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please wrap the code checking for the ElasticJobsViaWorkloadSlices feature gate. I know this is true that currently it means that workers.PreviousTopologyAssignment != nil, but code evolves and over checking that will be harder. using the feature gate will also give a clear indication why this was introduced.
| if features.Enabled(features.ElasticJobsViaWorkloadSlices) { | ||
| workloadSliceName = getWorkloadSliceOriginName(ctx, r.client, wl) | ||
| } | ||
| info, err := getPodSetsInfoFromStatus(ctx, r.client, wl, workloadSliceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like code duplication with https://github.com/kubernetes-sigs/kueue/pull/8580/changes#diff-552ab657667de70900b502e2617bcb405fbf4203ff08b2a35d7c87f2d637a75bR1062-R1067. Can we please wrap this under the a common function, or even better just under pre-existing getPodSetsInfoFromStatus
| return fmt.Errorf("setting index pod workload: %w", err) | ||
| } | ||
|
|
||
| if err := indexer.IndexField(ctx, &corev1.Pod{}, WorkloadSliceNameKey, indexPodWorkloadSliceName); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only register the indexer if ElasticJobViaWorkloadSlices is enabled
| gomega.Expect(k8sClient.Create(ctx, ns2)).To(gomega.Succeed()) | ||
| defer func() { | ||
| gomega.Expect(util.DeleteNamespace(ctx, k8sClient, ns2)).To(gomega.Succeed()) | ||
| }() | ||
|
|
||
| localQueue2 := utiltestingapi.MakeLocalQueue("local-queue", ns2.Name).ClusterQueue(clusterQueue.Name).Obj() | ||
| util.MustCreate(ctx, k8sClient, localQueue2) | ||
| defer func() { | ||
| gomega.Expect(util.DeleteObject(ctx, k8sClient, localQueue2)).To(gomega.Succeed()) | ||
| }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of that Please cleanup all tests in AfterEach by calling DeleteNamespace
mimowo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you 👍 This looks very well. The comments are mostly around:
- fixing the getWorkloadSliceOrigin to avoid relying on all workloads to stay
- removing code duplication
- removing unnecessary changes to minimize the diff for readabiility
- making sure all places modified are behind the FG
Other than that the changes look straightforward, and the feature is super valueable, while still in alpha. So I'm supportive to include to 0.16.
c38cb65 to
09baa0b
Compare
09baa0b to
9e2f076
Compare
e2548c8 to
17d0a6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a couple of potential issues here. First, I think we cannot move already running Pods, so we need to ensure the new assignment is a superset of the previous assignment. Looking at the code this is not necessarily the case.
Let's list some of the possible cases:
unconstrainedannotation used, looks like simplest, we basically run the LeastCapacity algo for the delta of Podspreferredannotation used - this looks generally harder, but because ideally we would place the Pods close to the previous, but since this is "best effort" we could probably use "BestFit" for the new delta of Podsrequiredannotation used, sayrequired-topology=block- this might be tricky to implement, really. This is similar problem as we solve in Node Hot Swap, but for multiple Pods. We could probably readily use the algo for one Pod, allowing scale ups per Pod, but this is generally tricky.
We don't need to solve them all while in alpha. TBH I prefer the PR to just give direction and be smaller, rather than big, but breaking :). I would rather just validate against the unsupported cases. We can easily gradually keep adding support while in alpha. Plus we test the scale down. Ideally both with e2e test.
So I'm thinking for simplicity maybe we just start with "unconstained" in 0.16. This could be already powerful to combine ElasticJobs with TAS to eliminate the quota fragmentation issues. WDYT?
17d0a6c to
1496a3b
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sohankunkerkar The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| TAS supports ElasticJobsViaWorkloadSlices. When both feature gates are enabled, TAS attempts to | ||
| preserve topology locality across workload slice transitions during job scaling for workloads with | ||
| **required** topology requests. Pods are indexed by a stable `kueue.x-k8s.io/workload-slice-name` | ||
| annotation (using the origin workload name in the slice chain), and the `TopologyAssignment` from | ||
| the old slice is used to constrain placement of new pods to the same topology domain. If the | ||
| previous assignment is stale (e.g., node unhealthy), TAS falls back to normal placement. Preferred | ||
| topology requests do not currently benefit from this preservation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just refer to the ElasticWorkloads KEP, and discuss the algo etc. under one place.
| ) | ||
| ``` | ||
|
|
||
| #### Topology Assignment Preservation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### Topology Assignment Preservation | |
| #### Scale Up |
| the `TopologyAssignment` from the old slice and passes it to TAS. TAS attempts to place | ||
| new pods in the same topology domain as existing pods. If the previous assignment is | ||
| stale (e.g., nodes are unhealthy), TAS falls back to normal placement. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clarify that for the first interation of the integration in 0.16 we will focus on supporting only the "unconstrained" TAS for scale up. Supporting scale down and other modes "preferred" and "required" will be subject for future iterations of the feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add the "Full integration with Topology-Aware Scheduling to graduation criteria for Beta, or clear validation against unsupported options"
mimowo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for readability I would suggest to actually de-couple the KEP update and the implementation to separate PRs. These can be also reviewed by other folks.
1496a3b to
148eb80
Compare
|
This change enables Topology-Aware Scheduling (TAS) to work correctly with the ElasticJobsViaWorkloadSlices feature, which allows jobs to dynamically scale via workload slices. When ElasticJobsViaWorkloadSlices is enabled and a job name is present, the job annotation takes precedence over workload annotation for pod lookups, ensuring all pods across slices are found correctly. Signed-off-by: Sohan Kunkerkar <[email protected]>
148eb80 to
97ffeea
Compare
What type of PR is this?
/kind feature
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #8160
Special notes for your reviewer:
When both TopologyAwareScheduling and ElasticJobsViaWorkloadSlices feature gates are enabled, Kueue now preserves topology locality across workload slice transitions during job scaling.
Does this PR introduce a user-facing change?