Layered cache¶
mneme is two caches stacked. Every get() first tries an O(1) exact-match lookup; only if that misses does it embed the query and run a similarity search.
flowchart TD
Q([get query]) --> N[normalize<br/>strip + collapse + casefold]
N --> H[SHA-256 hash]
H --> L1{exact match<br/>in store?}
L1 -- yes --> H1[Layer 1 hit<br/>~50 µs - 1 ms]
L1 -- no --> E[embed via Embedder]
E --> S[matvec across<br/>in-memory index]
S --> L2{best similarity<br/>≥ threshold?}
L2 -- yes --> H2[Layer 2 hit<br/>1 - 10 ms]
L2 -- no --> M[miss<br/>caller calls LLM]
style H1 fill:#0a3,color:#fff
style H2 fill:#06b,color:#fff
style M fill:#a40,color:#fff
Layer 1: exact match¶
The query is normalized:
- Strip leading/trailing whitespace.
- Collapse internal whitespace runs to single spaces.
- Casefold (Unicode-aware lowercasing).
Then SHA-256 hashed. The store does an O(1) lookup on (namespace, query_hash) - usually a primary-key or unique-index hit.
This catches all the trivial duplication: the same user phrasing the same question twice, the same canned form being submitted repeatedly, etc. It's free in latency and accuracy. There are no false positives - only literal duplicates (after normalization) hit Layer 1.
mneme returns Hit(layer="exact", similarity=1.0, ...) for these.
Layer 2: semantic match¶
If Layer 1 misses, the cache embeds the query and asks the in-memory index for the top-k closest cached vectors. Cosine similarity is used; vectors are L2-normalized so cosine is just a dot product.
If the best match scores at or above similarity_threshold (default 0.85), it's a Layer 2 hit. Otherwise: miss.
hit = cache.get("How do I cancel?")
if hit and hit.layer == "semantic":
print(f"matched cached entry at similarity {hit.similarity:.3f}")
The threshold is the central knob. Too high → many useful paraphrases are misses. Too low → unrelated queries collide. Calibrate it against your own corpus. See Calibration.
What happens on a miss¶
A miss returns None. The caller is expected to compute the answer (typically an LLM call) and cache.put() the result. The next paraphrase of the same intent is then a Layer 2 hit.
hit = cache.get(query)
if hit is None:
response = call_llm(query)
cache.put(query, response)
else:
response = hit.response
This is the canonical pattern. See Your first cached LLM.
Why two layers¶
Layer 2 alone would work - every query would be embedded and matched. But:
- Embedding has cost. Even a fast local model is 1–5 ms per embed; an OpenAI call is 100+ ms. Layer 1 short-circuits the whole pipeline for trivial duplicates.
- Hash collisions are predictable. SHA-256 on a normalized query is deterministic; if two strings normalize to the same hash, they really are the same query.
- Most production traffic is paraphrase-poor. Real chatbot logs show 30–60% of queries are exact duplicates after normalization, before any semantic logic kicks in. Skipping the embed for those queries is a big throughput win.
Layer 1 is conservative; Layer 2 is permissive. Together they cover the realistic distribution of duplicate-ish queries with the right cost trade-off at each level.
Hit object¶
Every cache hit returns a Hit:
@dataclass(frozen=True, slots=True)
class Hit:
response: str
similarity: float # 1.0 for exact; the cosine score for semantic
confidence: float # weighted by age, validator, custom fn
age_seconds: int # how long ago the entry was inserted
layer: HitLayer # "exact" or "semantic"
namespace: str
metadata: dict[str, Any] # whatever you put on insert
The confidence lets you decide whether to trust a hit. Default is a 24-hour half-life - confidence drops by half every day since insert. Pass your own confidence_fn= to SemanticCache(...) if you want different decay or your own scoring. See Confidence and validators.
Updating an entry¶
Calling cache.put() with a query that already exists in the cache (same namespace + same normalized hash) replaces the existing entry. The new vector + response + metadata override; the row id stays the same.
cache.put("how do I cancel", "Settings → Subscription → Cancel")
# ... time passes, you change the answer ...
cache.put("how do I cancel", "Account → Manage Plan → End Subscription")
hit = cache.get("how do I cancel")
assert hit.response == "Account → Manage Plan → End Subscription"
This is the desired behavior for keeping cached LLM answers fresh: re-put with the same query and the new response is what subsequent paraphrases get back.
Bypassing the cache¶
For testing or debugging:
The showcase's "Same query, no cache" button uses this to side-by-side time a cached vs non-cached call.
Where to go next¶
- Embedders - Layer 2 only works as well as your embedder.
- Multi-tenant - Layer 2 search is namespace-scoped.
- Calibration - pick the right
similarity_threshold. - Performance tuning - Layer 2 latency at scale.