-
Notifications
You must be signed in to change notification settings - Fork 221
Description
What happened:
Under bursty traffic conditions with Flow Control enabled, requests may be enqueued and subsequently orphaned. These requests are never successfully dispatched and eventually time out with a TTL expiry or context cancellation error.
Upon investigation, the managedQueue instances for these flows may be removed from the Shard's priorityBands map while requests are still pending inside them. This occurs because the FlowRegistry's "garbage collector" relies on a leaseCount that only tracks the duration of the distribution phase (the WithConnection call on the enqueue path), not the full request lifecycle. Once the distribution phase finishes, the lease count drops to zero. If the queue time exceeds FlowGCTimeout with no new enqueues (traffic is idle), the GC deletes the flow state, effectively "orphaning" the queue and making it invisible to the processor's dispatch loop.
What you expected to happen:
A Flow (and its associated queues on the shards) must not be Garbage Collected as long as there is at least one active request associated with it, regardless of whether that request is in the distribution phase or queueing.
Anything else we need to know?:
The current FlowRegistry GC logic uses a sync.RWMutex (gcLock) and an atomic leaseCount. The leaseCount is only incremented during the execution of the closure passed to WithConnection. Currently, the controller only wraps the tryDistribution logic in this closure.
Environment:
- Inference extension version:
main(Experimental Flow Control layer enabled)