Performance tuning¶
A practical guide to making mneme faster - or knowing when you've hit the ceiling for a given backend / dtype combo. The Performance baseline page has the measured numbers; this page is the what to do about them.
The latency budget¶
For a typical get:
For a typical put:
Each step has its own bottleneck. Tune the ones that dominate your workload.
Layer 1 (exact match)¶
Dominated by the store's read path - usually a primary-key lookup.
| Backend | p99 cost @ 100k entries | Bottleneck |
|---|---|---|
| MemoryStore | <50 µs | Python dict |
| SQLiteStore | ~2 ms | The UPDATE last_accessed_at per get (LRU bookkeeping) |
| RedisStore | ~0.5–1 ms | Network RTT + HGETALL + ZADD |
| PostgresStore | ~1–3 ms | Network RTT + UPDATE last_accessed_at |
| DynamoDBStore | ~2–5 ms | Network RTT + UpdateItem |
Tunes:
- Drop
update_accessif you don't need cross-process LRU. Subclass the store and make it a no-op;getdrops to <100 µs on SQLite. You lose accurate LRU eviction in stale-tolerant mode. - Use a connection pool for network stores. Per-call connection setup is the bulk of the latency on cold paths.
- Run the cache process near the store. Same AZ for Redis/Postgres/DynamoDB; same host (Unix socket) for Postgres/Redis when possible.
Layer 2 (semantic match)¶
Dominated by the matvec across the in-memory index.
For 100k × 1536:
float32: 614 MB read → ~6 ms at 100 GB/s memory bandwidthfloat16: 307 MB read + cast → ~10 ms (cast adds work)int8: 154 MB read + 614 MB cast write → ~50 ms (the cast dominates, not the matmul)
Tunes:
-
Pick the right
vector_dtype. float32 is fastest at matvec time; int8 is smallest in memory but slowest on pure NumPy. See Quantization. -
Know where NumPy's comfort zone ends. Search latency is
n × d × bytes / memory_bandwidth. The_AUTO_HNSW_THRESHOLD = 500_000in_cache.pyis a single-number heuristic that targets ~10–15 ms p99 at d=768 on baseline desktop hardware. Your actual ceiling depends on dim and RAM speed:
| Dim | NumPy fp32 stays under ~10 ms p99 to about |
|---|---|
| 384 | 1.5–2 M entries |
| 768 | 500 k entries |
| 1024 | 350 k entries |
| 1536 | 200 k entries |
| 3072 | 100 k entries |
Apple Silicon and other wide-bandwidth hardware push every row up by 2–3×; older laptops with DDR4 push them down. Measure your own.
- Switch to hnsw when NumPy's latency stops fitting.
index_backend="hnsw"does approximate-NN in O(log n) instead of full scan. Sub-millisecond search at 1M+ entries regardless of dim:
SemanticCache(..., index_backend="hnsw", index_options={
"M": 16, # graph degree; 16 is a good default
"ef_construction": 200, # build-time accuracy
"ef": 64, # query-time accuracy (higher = slower + more accurate)
})
hnsw builds the graph incrementally on each put; put latency rises slightly. Recall is ~99% at defaults - exact matches on the corpus's true nearest neighbor stay well above the 0.85 default threshold.
index_backend="auto"picks NumPy below 500 k entries, hnsw above. Falls back to NumPy + WARNING if[hnsw]extra isn't installed. The 500 k cutoff is the same single-number heuristic - for dim outside the 768 sweet spot, force the backend explicitly:
SemanticCache(..., index_backend="numpy") # at d=384 with 800 k entries - fine
SemanticCache(..., index_backend="hnsw") # at d=1536 with 250 k entries - better than auto
Embedder¶
For each Layer-2 get and every put:
| Embedder | Latency per call |
|---|---|
| sentence-transformers MiniLM (CPU) | 3–10 ms |
| sentence-transformers (GPU) | <1 ms |
OpenAI text-embedding-3-small |
50–150 ms (network) |
Local Ollama nomic-embed-text |
5–20 ms (localhost) |
| AWS Bedrock Titan | 50–150 ms (cross-AZ network) |
Tunes:
- Pre-embed in batches when you know the queries. Pass
embedding=vtoget/putto skip the embedder. Useful when you have a corpus and want to warm the cache offline. - Run the embedder co-located. Local model > network call. If you must use a hosted embedder, pin to the same region as your app.
- Skip the embedder on Layer-1 hits - the cache already does this. Make sure your monitoring distinguishes Layer 1 from Layer 2 latency so you don't optimize the wrong path.
Open time¶
Cold-start cost for the cache: read every entry from the store, push them into the in-memory index.
| Backend | Open time @ 100k/d=768 |
|---|---|
| MemoryStore | n/a (state lives in process) |
| SQLiteStore | ~300 ms |
| RedisStore | ~500 ms (network) |
| PostgresStore | ~400 ms |
| DynamoDBStore | ~3–10 s (Scan is slow) |
Tunes:
- Don't reopen often. The cache is meant to live for the lifetime of the process.
- For SQLite: the
iter_index_rows()fast path skips JSON metadata parsing during rebuild - already enabled. Pushing under 100 ms needs a binary blob format or memory-mapped store; not in v1. - For DynamoDB at scale: open time can be unbounded. Consider a per-tenant cache file with a smaller working set.
Eviction¶
Triggered when a put would breach max_entries or a namespace_quotas cap. Batches 10% of the cap (min 1) at a time.
For cap=10k, a triggering put deletes 1000 entries. On SQLite at p99 ~40 ms.
Tunes:
- Raise
max_entriesif you have RAM + disk for it. Eviction overhead drops; hit rate rises. - Set per-namespace
namespace_quotasinstead of one global cap. Per-tenant eviction keeps small tenants from being squeezed by big ones. - Run
cache.vacuum()periodically to clear TTL'd entries before they hit eviction. Cheaper than reactive eviction at scale.
Stale-tolerant polling¶
When multi_process_mode="stale-tolerant", the cache polls version_counter every stale_check_interval seconds.
Tunes:
- Smaller interval = fresher reads, more counter checks. 0.1 s for tight consistency, 1 s for relaxed.
- Bigger interval = more pending changes per check. Above a threshold, the cache full-rebuilds instead of applying deltas. Tune the interval to keep deltas small relative to your insert rate.
- For mostly-readonly fleets: 5–10 s is fine. The cache lags reality but recovers eventually.
hnsw tuning¶
If you've moved to index_backend="hnsw":
M(graph degree): higher = better recall, more memory. 16 default; 32 for large indices.ef_construction: build-time accuracy. 200 default; 400 for higher-recall indices.ef(query-time): exposed viaindex_options={"ef": 64}and adjustable per-query through the index. Higher = more accurate, slower. Sweep this against your calibration corpus.
Profiling¶
For workload-specific tuning:
import cProfile, pstats
with cProfile.Profile() as pr:
for _ in range(1000):
cache.get(query)
pstats.Stats(pr).sort_stats("cumulative").print_stats(20)
Look for embed, matvec, store calls in the top frames. The hot path tells you where to optimize.
Where to go next¶
- Performance baseline - measured numbers across stores and dtypes.
- Quantization - picking the right dtype.
- Multi-process - when stale-tolerant polling cost matters.