Your first cached LLM¶
The killer use case: an LLM call wrapped in mneme so paraphrased queries hit a microsecond cache instead of a multi-second model.
The pattern¶
def classify(query: str) -> str:
"""Customer-support intent classifier with caching."""
hit = cache.get(query, namespace="support")
if hit is not None:
return hit.response # <1 ms - exact or semantic
intent = call_llm(query) # ~1-3 s - the expensive part
cache.put(query, intent, namespace="support")
return intent
That's it. Three lines wrap any deterministic LLM call. The only thing you change is what call_llm does - classify, summarize, route, score, etc.
A concrete example¶
A real intent classifier against a local LLM (Ollama):
import json
import numpy as np
import requests
from mneme import SemanticCache
class SentenceTransformersEmbedder:
"""See the 'Bring your own embedder' page for the full version."""
# ... (omitted for brevity)
def call_llm(query: str) -> str:
"""Classify an inbound customer-support message into one of seven intents."""
resp = requests.post(
"http://localhost:11434/api/chat",
json={
"model": "nemotron-3-nano:latest",
"messages": [
{
"role": "system",
"content": (
"You are an intent classifier. Reply with JSON "
'{"intent": "<one of: billing, technical, refund, account, '
'how_to, complaint, other>"}. No other text.'
),
},
{"role": "user", "content": query},
],
"stream": False,
"think": False,
"format": "json",
"options": {"temperature": 0.0, "num_predict": 40},
},
timeout=60,
)
resp.raise_for_status()
body = resp.json()
return json.loads(body["message"]["content"])["intent"]
with SemanticCache(
path="cache.db",
embedder=SentenceTransformersEmbedder(),
similarity_threshold=0.65, # tune for your embedder + corpus
) as cache:
def classify(query: str) -> str:
hit = cache.get(query, namespace="support")
if hit is not None:
return hit.response
intent = call_llm(query)
cache.put(query, intent, namespace="support")
return intent
print(classify("How do I reset my password?")) # 1st call - ~500 ms (LLM)
print(classify("How do I reset my password?")) # 2nd call - <1 ms (exact)
print(classify("I forgot my password")) # paraphrase - ~25 ms (semantic)
For a fully runnable version with a 73-message corpus, a Flask UI, and live metrics, see the showcase.
What you get¶
| Hit type | Typical latency | What's running |
|---|---|---|
| Exact (Layer 1) | ~50 µs to 1 ms (depends on store) | One normalized SHA-256 + one store lookup |
| Semantic (Layer 2) | 1–10 ms at 100k entries | One embedding + one matvec across the in-memory matrix |
| Miss | Whatever your LLM costs (1–3 s) | LLM call + insert |
A reasonable cache-warming workload - say a chatbot that classifies 1000 customer-support messages with high paraphrase overlap - hits Layer 2 on 60–80% of requests after the first ~50. That's 600–800 LLM calls avoided, in time and tokens.
Failure modes¶
- Embedder fails on
get. Treated as a miss + WARNING. No exception bubbles up; the call falls through tocall_llm. This is the right default - observability cares, but the user doesn't. - Embedder fails on
put. The exception propagates. You wanted to cache but couldn't; that's a real failure. - LLM fails. Your problem, not
mneme's. The cache is uninvolved. call_llmreturns garbage (e.g."[ERROR]", empty string). The defaultValidatorrejects empty /[ERROR]/[LLM Error]strings onput. Plug in your ownvalidator=for stricter rules. See Confidence and validators.
Calibrate before you ship¶
The default similarity_threshold=0.85 is a starting point, not a recommendation. Calibrate against your own embedder + paraphrase corpus before going to production:
from mneme.tools.calibrate import find_threshold
result = find_threshold(
paraphrase_pairs=labeled_paraphrases, # list[(query_a, query_b)]
distractor_pairs=labeled_distractors, # list[(query_a, query_b)] - should NOT match
embedder=embedder,
target_metric="f1",
min_precision=0.95, # never accept below 95% precision
vector_dtype="float16", # calibrate against the same dtype prod uses
)
print(result.threshold, result.precision, result.recall, result.f1)
Wire the result into your SemanticCache(similarity_threshold=...) call. See Calibration for the full walkthrough.
Where to go next¶
- Layered cache - the two-layer story in depth.
- Calibration - find the right threshold.
- Showcase - the same pattern, packaged with a Flask UI and 73-message corpus.
- Multi-tenant - namespaces and per-tenant LRU quotas.
- Metrics - track hit rate and
LLM seconds savedover time.