Skip to content

Kueue does not remove the scheduling gate from Ray’s redis-cleanup jobs #8443

@ns-sundar

Description

@ns-sundar

What happened:
In a K8s cluster with Kueue and KubeRay 1.4.0, I deployed a Ray Serve workload via Kueue-managed queue, with the kueue.x-k8s.io/elastic-job: "true" annotation and spec.enableInTreeAutoscaling: true. When the Ray Serve is being terminated, KubeRay launches a redis-cleanup batch job, which runs a pod to clean up the redis-cleanup config used by the Ray Cluster (RC), which was launched by Ray Serve.
With Kueue, the redis-cleanup pod remains schedule-gated and in Pending state. Some logic in KubeRay eventually kills it but the cleanup never happens.
This is because Kueue does not remove the pod scheduling gate in the redis-cleanup pod. Analysis below.

What you expected to happen:
I expected the redis-cleanup pod to run to completion and the RC to exit gracefully.

How to reproduce it (as minimally and precisely as possible):

  • Deploy a Ray Serve CR via a Kueue-managed queue, with the annotation kueue.x-k8s.io/elastic-job: "true" and spec.enableInTreeAutoscaling: true.
  • Terminate the RayServe deployment.

Anything else we need to know?:
Here's my analysis so far:

  • User submits Ray Service CR with above specs.
  • KubeRay's Ray Service Controller creates a Ray Cluster (RC) from the above.
  • Kueue's Ray Cluster webhook intercepts it and sets the pod scheduler gate in the RC's pod spec.
    schedulingGates:
    - name: kueue.x-k8s.io/elastic-job
  • When the RC is terminated, KubeRay creates a batch/v1/Job named redis-cleanup, apparently from the head group spec of the RC. This job inherits the scheduling gate.
  • The batch/v1/Job creates a redis-cleanup pod, which also inherits the scheduling gate. This pod's owner reference is the batch/v1/Job, not the Ray Cluster.
  • Kueue reacts but does not remove the scheduling gate from the pod because it is owned by the Job. not the RC. Kueue only looks for pods owned by the RC.

Candidate solution: Modify the batch job webhook in Kueue to remove the scheduling gate from the job if (a) it is owned by a ray.io/v1/RayCluster, AND (b) has the label "ray.io/node-type = redis-cleanup" .

Rationale for the solution: Kueue does not to handle autoscaling for the redis-cleanup pod; so, it need not be treated as an elastic job.

Environment:

  • Kubernetes version (use kubectl version):
    • Client Version: v1.32.2
    • Kustomize Version: v5.5.0
    • Server Version: v1.32.9-eks-3025e55
  • Kueue version (use git describe --tags --dirty --always): 0.14.3
  • Cloud provider or hardware configuration: AWS EKS v1.32.9
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions