-
Notifications
You must be signed in to change notification settings - Fork 120
feat: Implement LLM call caching #243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Implement LLM call caching #243
Conversation
…rrency control for message classification.
📝 WalkthroughWalkthroughAdds a new asynchronous LLM caching layer (TTL LRU + in‑flight dedupe), message normalization and regex fast‑path classification, and integrates these into ClassificationRouter with concurrency limiting, message length guards, metrics, and fallback triage behavior. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Message Request
participant Router as ClassificationRouter
participant Helper as cache_helpers
participant Cache as TTL Cache
participant Inflight as In‑flight Registry
participant LLM as LLM Service
Client->>Router: should_process_message(message, context)
Router->>Helper: normalize_message(message)
Helper-->>Router: normalized
Router->>Helper: is_simple_message(normalized)
Helper-->>Router: fast‑path result or None
alt fast‑path
Router-->>Client: return classification
else
Router->>Helper: key_for_normalized(normalized,...)
Router->>Helper: get_cached_by_normalized(...)
Helper->>Cache: lookup
Cache-->>Helper: cached payload or miss
Helper-->>Router: payload or None
alt cache hit
Router-->>Client: return cached classification
else cache miss
Router->>Inflight: check/register future for key
alt another inflight exists
Inflight-->>Router: await shared future -> result
else owner:
Router->>LLM: ainvoke(triage prompt) (guarded by semaphore)
LLM-->>Router: response
Router->>Helper: set_cached_by_normalized(..., payload)
Router->>Inflight: resolve future + cleanup
end
Router-->>Client: return classification
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧹 Recent nitpick comments
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🪛 Ruff (0.14.11)backend/app/classification/cache_helpers.py17-17: Do not catch blind exception: (BLE001) 103-103: Do not catch blind exception: (BLE001) 188-188: Consider moving this statement to an (TRY300) 🔇 Additional comments (5)
✏️ Tip: You can disable this entire section by setting Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@smokeyScraper @chandansgowda Can you review this I think I have implemented it the best I could any insights from you guys ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@backend/app/classification/cache_helpers.py`:
- Around line 62-80: The _PATTERN_CLASSIFICATION mapping contains orphan keys
"integration" and "architecture" that have no corresponding regex entries in
_PATTERNS, so is_simple_message() can never match them; fix by either removing
those two keys from _PATTERN_CLASSIFICATION or adding corresponding regex
patterns to _PATTERNS (e.g., patterns that match
oauth/discord/github/webhook/api/integration for "integration" and
queue/database/redis/llm/infra/architecture for "architecture") so the keys and
patterns are consistent with the existing classification logic.
- Around line 9-19: The xxhash availability check is broken because import
xxhash is unconditional; move the import into the try/except and set the flag
based on whether the import succeeded: wrap "import xxhash" in a try block, on
success set _HAS_XXHASH = True, on ImportError set xxhash = None and _HAS_XXHASH
= False (and keep existing hashlib/json imports outside). Update any existing
references to _HAS_XXHASH and xxhash to rely on this flag so code falls back to
hashlib when xxhash is not installed.
- Around line 177-199: The _owner_fetch coroutine computes an unused variable
elapsed and the created background task is not tracked or
cancellation-propagated; modify cached_llm_call to either remove or use elapsed
(e.g., include it in logs), assign the created task to a variable (owner_task =
loop.create_task(_owner_fetch())), add a done callback on owner_task to
set_exception on future if the task failed and future not done, and wrap the
await future in a try/except asyncio.CancelledError that cancels owner_task
(owner_task.cancel()), awaits owner_task to avoid orphaned execution, then
re-raises the cancellation; keep the existing finally that pops _inflight and
continue setting _cache and future as currently implemented.
🧹 Nitpick comments (3)
backend/app/classification/cache_helpers.py (1)
100-104: Consider logging the fallback for non-serializable params.The broad exception catch on line 103 silently falls back to
str(params). While this defensive approach is appropriate, adding a debug log would help identify cases where params contain unexpected non-serializable types.♻️ Suggested improvement
try: params_blob = json.dumps(params or {}, sort_keys=True, separators=(",", ":"), default=str).encode("utf-8") - except Exception: + except (TypeError, ValueError) as e: + logger.debug(f"Params serialization fallback: {type(e).__name__}") params_blob = str(params).encode("utf-8")backend/app/classification/classification_router.py (2)
53-57: Simplify context ID extraction.The current logic is a bit convoluted:
or ""followed byif not ctx_id: ctx_id = Noneis redundant since empty string is already falsy.♻️ Simplified version
ctx_id = None if context: - ctx_id = context.get("channel_id") or context.get("thread_id") or "" - if not ctx_id: - ctx_id = None + ctx_id = context.get("channel_id") or context.get("thread_id") or None
59-59: Consider extracting duplicated params dict.The params dict
{"temperature": 0.1}is hardcoded both here and on line 88. Consider extracting to a constant to ensure consistency.♻️ Extract to constant
+_LLM_PARAMS = {"temperature": 0.1} + class ClassificationRouter: ... - cached = get_cached_by_normalized(normalized, ctx_id, settings.classification_agent_model, {"temperature": 0.1}) + cached = get_cached_by_normalized(normalized, ctx_id, settings.classification_agent_model, _LLM_PARAMS) ... - set_cached_by_normalized(normalized, ctx_id, settings.classification_agent_model, {"temperature": 0.1}, payload) + set_cached_by_normalized(normalized, ctx_id, settings.classification_agent_model, _LLM_PARAMS, payload)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
backend/app/classification/cache_helpers.pybackend/app/classification/classification_router.py
🧰 Additional context used
🧬 Code graph analysis (2)
backend/app/classification/classification_router.py (1)
backend/app/classification/cache_helpers.py (4)
normalize_message(201-205)is_simple_message(207-211)get_cached_by_normalized(129-132)set_cached_by_normalized(135-138)
backend/app/classification/cache_helpers.py (2)
backend/app/services/embedding_service/service.py (2)
model(40-51)llm(54-67)backend/app/core/orchestration/queue_manager.py (1)
start(47-56)
🪛 Ruff (0.14.11)
backend/app/classification/cache_helpers.py
17-17: Do not catch blind exception: Exception
(BLE001)
103-103: Do not catch blind exception: Exception
(BLE001)
181-181: Local variable elapsed is assigned to but never used
Remove assignment to unused variable elapsed
(F841)
186-186: Consider moving this statement to an else block
(TRY300)
198-198: Store a reference to the return value of loop.create_task
(RUF006)
🔇 Additional comments (9)
backend/app/classification/cache_helpers.py (4)
21-24: LGTM!The cache configuration constants are reasonable: 4096 entries, 1-hour TTL, and 10KB message limit for DoS protection.
82-87: LGTM!Good use of bounded
TTLCachefor both the result cache and in-flight deduplication. The 2-minute TTL on_inflightprevents memory leaks from orphaned entries.
201-211: LGTM!The normalization logic is solid: truncation for DoS protection, lowercasing and whitespace collapsing for cache hit improvement. The
is_simple_messagefunction correctly iterates patterns and returns a copy of the classification dict with the original message attached.
213-222: LGTM!Clean cache access wrappers with proper exception handling for missing keys.
backend/app/classification/classification_router.py (5)
1-16: LGTM!Appropriate imports added for the caching integration. The selective imports from
cache_helperskeep the namespace clean.
20-21: LGTM!The semaphore with a limit of 10 concurrent LLM calls is a sensible default for rate limiting and cost control.
37-40: LGTM!Good early return for oversized messages. This prevents DoS via large payloads and aligns with the
MAX_MESSAGE_LENGTHconstant fromcache_helpers.
71-74: LGTM!The semaphore correctly guards the LLM invocation to limit concurrent calls.
82-89: LGTM!The payload construction and caching logic is correct. The structure aligns with
_fallback_triageoutput format.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/app/classification/cache_helpers.py`:
- Around line 207-211: is_simple_message currently returns the first match from
_PATTERNS which lets low-priority patterns like "greeting" shadow
higher-priority intents such as "action_request"; fix by prioritizing actionable
intents: either reorder _PATTERNS so "action_request" (and other actionable
patterns) appear before "greeting", or implement priority-based matching by
reading a priority field from _PATTERN_CLASSIFICATION (or a separate priority
map) and iterating _PATTERNS to collect matches then return the match with
highest priority; update is_simple_message to use the chosen approach and ensure
it still returns dict(_PATTERN_CLASSIFICATION[name],
original_message=normalized) for the selected top-priority name.
♻️ Duplicate comments (2)
backend/app/classification/cache_helpers.py (2)
62-80: Orphan classification entries remain unaddressed.The
_PATTERN_CLASSIFICATIONdict still contains "integration" (line 68) and "architecture" (line 69) entries with no corresponding patterns in_PATTERNS. These classifications will never be matched byis_simple_message().
177-199: Task management issues remain unaddressed.The previously flagged concerns still apply:
elapsed(line 181) is computed but never used- The task created on line 198 is not stored, risking unhandled exceptions being silently dropped
- Cancellation of the caller is not propagated to the owner task
🧹 Nitpick comments (1)
backend/app/classification/cache_helpers.py (1)
14-19: Narrow the exception type toImportError.The import structure is now correct, but catching bare
Exceptionis overly broad and can mask unexpected errors. UseImportErrorfor import failures.try: import xxhash _HAS_XXHASH = True -except Exception: +except ImportError: xxhash = None _HAS_XXHASH = False
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
backend/app/classification/cache_helpers.py
🧰 Additional context used
🪛 Ruff (0.14.11)
backend/app/classification/cache_helpers.py
17-17: Do not catch blind exception: Exception
(BLE001)
103-103: Do not catch blind exception: Exception
(BLE001)
181-181: Local variable elapsed is assigned to but never used
Remove assignment to unused variable elapsed
(F841)
186-186: Consider moving this statement to an else block
(TRY300)
198-198: Store a reference to the return value of loop.create_task
(RUF006)
🔇 Additional comments (7)
backend/app/classification/cache_helpers.py (7)
21-24: LGTM!Configuration constants are well-chosen: reasonable cache size, appropriate TTL, and sensible message length limit for DoS protection.
26-58: LGTM!The pre-compiled regex patterns are well-structured for fast-path classification. The combination of anchored patterns for greetings/acknowledgments and flexible patterns for content detection is appropriate.
82-87: LGTM!Good use of
TTLCachefor both the main cache and in-flight deduplication. The TTL on_inflightprevents memory leaks from orphaned requests. The metrics counter is appropriate for observability.
91-112: LGTM!The cache key generation is well-designed with proper normalization, deterministic serialization, and appropriate fallbacks. The broad exception catch for JSON serialization fallback is acceptable here since we want to handle any edge case gracefully.
115-138: LGTM!The key composition and cache access helper functions are well-structured with clear separation of concerns.
201-205: LGTM!Normalization is well-implemented with proper truncation, whitespace handling, and case normalization for consistent cache key generation.
213-222: LGTM!Cache access functions are simple and correct. Minor note:
cache_getcould use_cache.get(key)for the same behavior, but the current implementation is clear.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
… background tasks
Closes #241
📝 Description
This PR fixes the unnecessary LLM API calls made by the user either in the Discord or GitHub channels by implementing a multi-layer caching and optimization system for message classification.
🔧 Changes Made
cache_helpers.py
MAX_MESSAGE_LENGTH(10KB): Truncates oversized messages before processing to prevent DoS attacks and excessive memory usage patches security issues_inflightdict withTTLCache(maxsize=1000, ttl=120)to prevent memory leaks from orphaned requestsasyncio.CancelledErrorhandling to prevent request hangs when tasks are cancelledclassification_router.py
asyncio.Semaphore(10)to limit concurrent LLM API calls, preventing rate limit errors and cost explosions during traffic spikesMAX_MESSAGE_LENGTHwith fallback classificationImpact
Optimization Summary
✅ Checklist
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.