Skip to content

Scheduling uses only the first term of PersistentVolume nodeAffinity #2742

@tallaxes

Description

@tallaxes

Description

Karpenter's volume topology scheduling logic currently only considers the first term from PersistentVolume nodeAffinity. When volumes define multiple topology terms (which are ORed together), Karpenter fails to properly schedule pods across all available options. This is particularly problematic with CSI drivers that generate PersistentVolumes with multiple allowed topologies - for example, the Azure CSI driver for ZRS (Zone-Redundant Storage) disks creates PVs with separate terms for each available zone. The same problem applies to StorageClass allowedTopologies - though, while allowed, I don't think multiple terms are used much in practice.

// Terms are ORed, only use the first term
requirements = pv.Spec.NodeAffinity.Required.NodeSelectorTerms[0].MatchExpressions

// Terms are ORed, only use the first term
for _, requirement := range storageClass.AllowedTopologies[0].MatchLabelExpressions {

Observed Behavior:

When a PersistentVolume (or a StorageClass) specifies multiple topology terms (e.g., multiple zones, regions, or other topology keys), only the first term's constraints are applied to node affinity. E.g. with the following affinity in PV, Karpenter will only consider westus3-1 as allowed:

nodeAffinity:
  required:
    nodeSelectorTerms:
    - matchExpressions: [{key: topology.disk.csi.azure.com/zone, operator: In, values: [westus3-1]}]
    - matchExpressions: [{key: topology.disk.csi.azure.com/zone, operator: In, values: [westus3-2]}]
    - matchExpressions: [{key: topology.disk.csi.azure.com/zone, operator: In, values: [westus3-3]}]
    - matchExpressions: [{key: topology.disk.csi.azure.com/zone, operator: In, values: [westus3-4]}]

Note that an equivalent alternative representation would have worked:

nodeAffinity:
  required:
    nodeSelectorTerms:
    - matchExpressions: [{key: topology.disk.csi.azure.com/zone, operator: In, values: [westus3-1, westus3-2, westus3-3, westus3-4]}]

... but this is not what CSI driver generates. It looks like it does not have a choice, the driver produces "accessible topology" segment maps, and external-provisioner translates each segment map into a nodeSelectorTerm, and this OR-of-ANDs with single-valued keys is there for a good reason, most likely can't be converted into a shorter form in a generic case.

As a result

  • The OR semantics of multiple topology terms are lost, and the cross-product of requirements is not computed
  • Pod anti-affinity rules combined with multi-term volumes cannot properly distribute pods as intended
  • Workloads using CSI-provisioned volumes with multiple allowed topologies (e.g. Azure ZRS) fail to schedule correctly

Expected Behavior:

  • All topology terms specified in StorageClass AllowedTopologies and PersistentVolume NodeAffinity.Required should be respected
  • When volumes specify multiple topology terms (whether for zones, regions, or other topology keys), the scheduler should treat these as ORed options
  • Scheduling simulation should properly compute the cartesian product of node affinity terms to maintain correct AND/OR semantics
  • Pods should be able to schedule across all available topology options when combined with pod affinity/anti-affinity constraints

Reproduction Steps (Please include YAML):

For this to manifest on PV, the following needs to be true:

  • PV has more than one term (e.g. more than one allowed zone)
  • PV exists when Pod is being scheduled (can either use Immediate binding mode, or delete/re-schedule pod after PV is provisioned)
  • Pod has constraints other than the first PV constraint (e.g. references second zone)
Example for Azure, using StorageClass with ZRS and immediate provisioning:

Create StorageClass for ZRS and immediate binding:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-csi-zrs
provisioner: disk.csi.azure.com
parameters:
  skuName: Premium_ZRS
volumeBindingMode: Immediate

Create a PersistentVolumeClaim referencing that StorageClass:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: zrs-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: managed-csi-zrs
  resources:
    requests:
      storage: 10Gi

At this point CSI driver will provision PV that looks something like this (note multiple nodeAffinity terms for zones):

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: disk.csi.azure.com
    volume.kubernetes.io/provisioner-deletion-secret-name: ""
    volume.kubernetes.io/provisioner-deletion-secret-namespace: ""
  creationTimestamp: "2025-12-24T03:33:21Z"
  finalizers:
  - external-provisioner.volume.kubernetes.io/finalizer
  - kubernetes.io/pv-protection
  name: pvc-4e13e628-ef1b-4d32-98f5-5735b5147886
  resourceVersion: "1242403"
  uid: 02ba67e0-2b40-4f78-8d72-294e4e61191e
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: zrs-pvc
    namespace: default
    resourceVersion: "1242387"
    uid: 4e13e628-ef1b-4d32-98f5-5735b5147886
  csi:
    driver: disk.csi.azure.com
    volumeAttributes:
      csi.storage.k8s.io/pv/name: pvc-4e13e628-ef1b-4d32-98f5-5735b5147886
      csi.storage.k8s.io/pvc/name: zrs-pvc
      csi.storage.k8s.io/pvc/namespace: default
      requestedsizegib: "10"
      skuName: Premium_ZRS
      storage.kubernetes.io/csiProvisionerIdentity: <snip>
    volumeHandle: <snip>
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.disk.csi.azure.com/zone
          operator: In
          values:
          - westus3-1
      - matchExpressions:
        - key: topology.disk.csi.azure.com/zone
          operator: In
          values:
          - westus3-2
      - matchExpressions:
        - key: topology.disk.csi.azure.com/zone
          operator: In
          values:
          - westus3-3
      - matchExpressions:
        - key: topology.disk.csi.azure.com/zone
          operator: In
          values:
          - westus3-4
      - matchExpressions:
        - key: topology.disk.csi.azure.com/zone
          operator: In
          values:
          - ""
  persistentVolumeReclaimPolicy: Delete
  storageClassName: managed-csi-zrs
  volumeMode: Filesystem
status:
  lastPhaseTransitionTime: "2025-12-24T03:33:21Z"
  phase: Bound

Create pod referencing that PVC, but requesting the second zone:

apiVersion: v1
kind: Pod
metadata:
  name: pod-in-zone-2
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - westus3-2  # requesting second zone
  containers:
  - name: app
    image: nginx
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: zrs-pvc

Expected: Pod should schedule successfully in westus3-2 (the second allowed zone).
Actual: Pod fails to schedule because Karpenter only considers the first zone (westus3-1). The event on the pod will include requirement "topology.kubernetes.io/zone DoesNotExist".

Versions:

  • Chart Version: 1.6.x (but should be the same in later versions)
  • Kubernetes Version (kubectl version): tested on 1.33.5 (but should be the same on others)
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions