-
Notifications
You must be signed in to change notification settings - Fork 411
Description
Description
Karpenter's volume topology scheduling logic currently only considers the first term from PersistentVolume nodeAffinity. When volumes define multiple topology terms (which are ORed together), Karpenter fails to properly schedule pods across all available options. This is particularly problematic with CSI drivers that generate PersistentVolumes with multiple allowed topologies - for example, the Azure CSI driver for ZRS (Zone-Redundant Storage) disks creates PVs with separate terms for each available zone. The same problem applies to StorageClass allowedTopologies - though, while allowed, I don't think multiple terms are used much in practice.
karpenter/pkg/controllers/provisioning/scheduling/volumetopology.go
Lines 146 to 147 in 85a3446
| // Terms are ORed, only use the first term | |
| requirements = pv.Spec.NodeAffinity.Required.NodeSelectorTerms[0].MatchExpressions |
karpenter/pkg/controllers/provisioning/scheduling/volumetopology.go
Lines 125 to 126 in 85a3446
| // Terms are ORed, only use the first term | |
| for _, requirement := range storageClass.AllowedTopologies[0].MatchLabelExpressions { |
Observed Behavior:
When a PersistentVolume (or a StorageClass) specifies multiple topology terms (e.g., multiple zones, regions, or other topology keys), only the first term's constraints are applied to node affinity. E.g. with the following affinity in PV, Karpenter will only consider westus3-1 as allowed:
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions: [{key: topology.disk.csi.azure.com/zone, operator: In, values: [westus3-1]}]
- matchExpressions: [{key: topology.disk.csi.azure.com/zone, operator: In, values: [westus3-2]}]
- matchExpressions: [{key: topology.disk.csi.azure.com/zone, operator: In, values: [westus3-3]}]
- matchExpressions: [{key: topology.disk.csi.azure.com/zone, operator: In, values: [westus3-4]}]Note that an equivalent alternative representation would have worked:
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions: [{key: topology.disk.csi.azure.com/zone, operator: In, values: [westus3-1, westus3-2, westus3-3, westus3-4]}]... but this is not what CSI driver generates. It looks like it does not have a choice, the driver produces "accessible topology" segment maps, and external-provisioner translates each segment map into a nodeSelectorTerm, and this OR-of-ANDs with single-valued keys is there for a good reason, most likely can't be converted into a shorter form in a generic case.
As a result
- The OR semantics of multiple topology terms are lost, and the cross-product of requirements is not computed
- Pod anti-affinity rules combined with multi-term volumes cannot properly distribute pods as intended
- Workloads using CSI-provisioned volumes with multiple allowed topologies (e.g. Azure ZRS) fail to schedule correctly
Expected Behavior:
- All topology terms specified in StorageClass
AllowedTopologiesand PersistentVolumeNodeAffinity.Requiredshould be respected - When volumes specify multiple topology terms (whether for zones, regions, or other topology keys), the scheduler should treat these as ORed options
- Scheduling simulation should properly compute the cartesian product of node affinity terms to maintain correct AND/OR semantics
- Pods should be able to schedule across all available topology options when combined with pod affinity/anti-affinity constraints
Reproduction Steps (Please include YAML):
For this to manifest on PV, the following needs to be true:
- PV has more than one term (e.g. more than one allowed zone)
- PV exists when Pod is being scheduled (can either use Immediate binding mode, or delete/re-schedule pod after PV is provisioned)
- Pod has constraints other than the first PV constraint (e.g. references second zone)
Example for Azure, using StorageClass with ZRS and immediate provisioning:
Create StorageClass for ZRS and immediate binding:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-csi-zrs
provisioner: disk.csi.azure.com
parameters:
skuName: Premium_ZRS
volumeBindingMode: ImmediateCreate a PersistentVolumeClaim referencing that StorageClass:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: zrs-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: managed-csi-zrs
resources:
requests:
storage: 10GiAt this point CSI driver will provision PV that looks something like this (note multiple nodeAffinity terms for zones):
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: disk.csi.azure.com
volume.kubernetes.io/provisioner-deletion-secret-name: ""
volume.kubernetes.io/provisioner-deletion-secret-namespace: ""
creationTimestamp: "2025-12-24T03:33:21Z"
finalizers:
- external-provisioner.volume.kubernetes.io/finalizer
- kubernetes.io/pv-protection
name: pvc-4e13e628-ef1b-4d32-98f5-5735b5147886
resourceVersion: "1242403"
uid: 02ba67e0-2b40-4f78-8d72-294e4e61191e
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 10Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: zrs-pvc
namespace: default
resourceVersion: "1242387"
uid: 4e13e628-ef1b-4d32-98f5-5735b5147886
csi:
driver: disk.csi.azure.com
volumeAttributes:
csi.storage.k8s.io/pv/name: pvc-4e13e628-ef1b-4d32-98f5-5735b5147886
csi.storage.k8s.io/pvc/name: zrs-pvc
csi.storage.k8s.io/pvc/namespace: default
requestedsizegib: "10"
skuName: Premium_ZRS
storage.kubernetes.io/csiProvisionerIdentity: <snip>
volumeHandle: <snip>
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.disk.csi.azure.com/zone
operator: In
values:
- westus3-1
- matchExpressions:
- key: topology.disk.csi.azure.com/zone
operator: In
values:
- westus3-2
- matchExpressions:
- key: topology.disk.csi.azure.com/zone
operator: In
values:
- westus3-3
- matchExpressions:
- key: topology.disk.csi.azure.com/zone
operator: In
values:
- westus3-4
- matchExpressions:
- key: topology.disk.csi.azure.com/zone
operator: In
values:
- ""
persistentVolumeReclaimPolicy: Delete
storageClassName: managed-csi-zrs
volumeMode: Filesystem
status:
lastPhaseTransitionTime: "2025-12-24T03:33:21Z"
phase: BoundCreate pod referencing that PVC, but requesting the second zone:
apiVersion: v1
kind: Pod
metadata:
name: pod-in-zone-2
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- westus3-2 # requesting second zone
containers:
- name: app
image: nginx
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: zrs-pvcExpected: Pod should schedule successfully in westus3-2 (the second allowed zone).
Actual: Pod fails to schedule because Karpenter only considers the first zone (westus3-1). The event on the pod will include requirement "topology.kubernetes.io/zone DoesNotExist".
Versions:
- Chart Version: 1.6.x (but should be the same in later versions)
- Kubernetes Version (
kubectl version): tested on 1.33.5 (but should be the same on others)
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment