-
Notifications
You must be signed in to change notification settings - Fork 221
Description
What would you like to be added:
Support for engine-aware metric collection in the EPP, allowing a single InferencePool to include Pods running different inference engines (e.g., vLLM, SGLang) while still collecting the correct metrics per Pod.
This would enable use cases such as:
- A/B testing between inference engines
- Gradual engine migration within the same pool
- Mixed-engine pools with a unified routing layer
One possible design direction could include:
- A Mapping Registry that defines metric mappings per engine type
- Runtime selection of metric mappings based on Pod labels (e.g.,
inference.k8s.io/engine-type) - Optional configuration flags to enable multi-engine support
- Backward compatibility where existing single-engine setups continue to work as a default mapping
(This is just one possible approach, and open to discussion.)
Why is this needed:
Currently, EPP assumes that all Pods in an InferencePool expose identical metric names. This assumption breaks down in real-world scenarios where:
- Different inference engines expose different metric schemas
- Users want to compare or migrate engines without splitting pools
For example:
vllm:num_requests_waitingsglang:num_queue_reqs
When such Pods are mixed, metric collection fails or becomes unreliable, preventing effective routing decisions.
Supporting engine-aware metric collection would make InferencePool more flexible and better aligned with real deployment patterns.
If this direction aligns with the project goals, I’m happy to follow up with a PR or contribute an initial implementation for discussion.