Capacity planning & RAM management¶
This page is the honest answer to two questions:
- How big can a
mnemecache get before something breaks? - How does RAM behave over time, and how do I keep it from drifting?
The short version:
- The store scales independently of the index (SQLite handles billions of rows; Redis/Postgres/DynamoDB are effectively unbounded).
- The in-memory index is the real ceiling on a single host. Past ~500k entries at d=768 fp32, switch to
index_backend="hnsw". Past 1M, you usually also wantvector_dtype="int8". - The index is a NumPy matrix (or hnswlib graph) that grows as you
putand does not shrink ondelete/vacuumuntil you callcompact()— long-running caches with churn must compact periodically.
The rest of the page makes those statements concrete.
Per-store scaling limits¶
| Store | Storage limit | Latency at 1M entries (p99) | Where it stops scaling |
|---|---|---|---|
MemoryStore |
RAM | ~µs | Process restart loses everything; bound by host RAM |
SQLiteStore (default) |
Disk; SQLite handles billions of rows | ~3 ms | Index rebuild on open() (seconds at 1M) |
RedisStore |
Redis server RAM | ~0.5–2 ms (same DC) | Redis instance memory |
PostgresStore |
Disk; effectively unbounded | ~2–10 ms | Network round-trip cost dominates |
DynamoDBStore |
AWS-managed; effectively unbounded | ~10–30 ms | Per-item cost; eventually consistency knobs |
Latency numbers are rough — your network and hardware vary, but the relative ordering holds. See Performance for measured baselines.
The hidden ceiling: the in-memory index¶
Regardless of which store you pick, every embedding is also held in an in-memory matrix (or hnswlib graph) so Layer-2 cosine similarity is fast. That is the true single-host scaling wall:
| Records | NumpyIndex RAM (d=768, fp32) | NumpyIndex p99 search | Comment |
|---|---|---|---|
| 100k | ~300 MB | ~3 ms | Sweet spot; default config |
| 500k | ~1.5 GB | ~12 ms | Auto-switches to hnsw at this point |
| 1M | ~3 GB | ~25 ms (NumPy) / ~0.5 ms (hnsw) | hnsw becomes mandatory |
| 10M | ~30 GB | NumPy infeasible / ~1 ms (hnsw) | hnsw + tuned M/ef |
Two knobs change the math 4×:
vector_dtype="int8"— cuts index RAM by 4× (1M × 768 × 1 byte ≈ 750 MB). Some accuracy loss (~1–3% similarity drift); calibrate against your corpus.index_backend="hnsw"(or"auto"past 500k) — makes search O(log N) at sub-ms even at 10M.
TTL — time-based expiry¶
Set a max age per entry. Two ways:
SemanticCache(
path="cache.db",
embedder=my_embedder,
default_ttl=86400, # every entry expires 24h after creation
)
# or per-put override:
cache.put(query, response, ttl=3600) # this entry expires in 1h
How TTL actually expires:
- Lazy on read (always on): every
cache.get()checks the candidate'screated_at + ttl; if stale, the entry is deleted on the spot, theexpiredmetric increments, and the get returnsNone. Entries you query naturally fall off when stale. - Sweep on demand:
cache.vacuum()scans everything and deletes any expired entries up front. Not automatic — you call it from a scheduler.
# good place: a daily cron, or once on app boot
n_purged = cache.vacuum()
log.info(f"vacuumed {n_purged} expired entries")
There is no background thread doing this automatically — by design. mneme is a library, not a daemon. Schedule cache.vacuum() from your app's existing scheduler (APScheduler, Celery beat, cron, systemd timer).
Re-put resets the TTL
A second put() for the same query replaces the existing entry with a fresh created_at and the new TTL (or default_ttl if not specified). So calling cache.put(q, fresh_response) to refresh stale data extends the entry's life by default_ttl from now — not by remaining TTL. If you want a "refresh-doesn't-extend" pattern, pass ttl=remaining_seconds explicitly.
LRU eviction — size-based fall-off¶
Set a max number of entries; the cache automatically drops the least-recently-accessed entries when you go over.
How LRU actually evicts:
- Runs automatically on every
put(). No cron, no manual call. - "Oldest" = lowest
last_accessed_at. Everyget()updates that timestamp, so frequently-queried entries stay; quiet ones fall off first. - Eviction is batched: when you go over, it drops
max(excess, 10% of cap, 1)entries at once, so it amortizes the cost.
Per-namespace quotas for multi-tenant fairness:
Per-namespace quotas take precedence over the global cap; one chatty namespace can't push out another's data. See Multi-tenant for the full picture.
How RAM actually behaves — and the tombstone problem¶
This is the part most users miss until they hit it in production.
What happens on put¶
The in-memory NumPy matrix has a starting capacity (256 rows). When you exceed it, the matrix is reallocated at 2× capacity and the old rows are copied. Capacity grows geometrically:
What happens on delete, TTL expiry, or LRU eviction¶
The corresponding row in the matrix is marked as a tombstone (_tombstones: set[int]) but its bytes are not reclaimed. The matrix capacity stays at whatever it was.
Concrete consequence:
1. put 1,000,000 entries → matrix is 1M × 768 × 4B ≈ 3 GB
2. TTL expires 990,000 entries → matrix is still ~3 GB; 10k rows are "live"
3. cache.stats().entries → 10,000 (correct; queried from store)
4. cache.stats().memory_bytes_estimate → ~30 MB (live × dim × dtype) — does NOT include tombstones
5. cache.stats().index_memory_bytes → ~3 GB (the actual matrix nbytes)
The two new Stats fields (index_memory_bytes and index_tombstone_count) make this drift visible. See Types reference for full field definitions. If they're out of proportion to entries, your in-memory index needs to be compacted.
How to actually reclaim the memory¶
Call cache.compact():
# Reclaims tombstone memory in the in-memory index. The store is not touched —
# entries already deleted from the store remain deleted.
n_reclaimed = cache.compact()
Or let vacuum() do it for you (the default since v1.0):
# vacuum() purges TTL-expired entries AND auto-compacts the index.
n_purged = cache.vacuum() # auto-compact=True (default)
n_purged = cache.vacuum(compact=False) # split if you want compact on a different cadence
compact() is cheap when there are no tombstones (early-return). It does not touch the store; entries already deleted from the store remain deleted.
For HnswIndex, compact rebuilds each per-namespace index from the live set — same effect as for NumpyIndex but the implementation is a graph rebuild rather than an in-place matrix shrink.
Compact cost at scale
Compact is O(N) over the live entry count: NumpyIndex allocates a fresh matrix and copies the live rows; HnswIndex re-add_items each live row into a fresh hnswlib index. At 1M entries that's ~tens to hundreds of milliseconds and holds the cache lock for the duration. Schedule it from a non-critical-latency context (overnight cron, low-traffic window) for big caches. The async cache routes compact through asyncio.to_thread, so it doesn't block the event loop's I/O — but the cache lock is held the whole time, so other concurrent gets/puts wait.
Operational recipe — production deployment¶
For a 1M-entry single-host cache that doesn't drift:
from mneme import SemanticCache
cache = SemanticCache(
path="cache.db",
embedder=my_embedder,
index_backend="hnsw", # mandatory at this scale
vector_dtype="int8", # 4× memory cut
max_entries=1_000_000, # LRU cap so RAM is bounded
default_ttl=30 * 24 * 3600, # 30-day expiry
)
# Once a day, somewhere (cron, APScheduler, your existing scheduler):
def daily_maintenance() -> None:
n = cache.vacuum() # purge TTL-expired + auto-compact
log.info(f"daily vacuum: removed {n} expired")
s = cache.stats()
log.info(
f"entries={s.entries} "
f"index_memory_bytes={s.index_memory_bytes} "
f"tombstones={s.index_tombstone_count}"
)
Multi-host (5M entries, shared cache)¶
from mneme import SemanticCache, RedisStore
cache = SemanticCache(
store=RedisStore(url="redis://cache.internal:6379/0"),
embedder=my_embedder,
index_backend="hnsw",
vector_dtype="int8",
max_entries=5_000_000,
default_ttl=14 * 24 * 3600,
)
Each replica rebuilds its in-memory HnswIndex from Redis on open() (one-time cost; ~tens of seconds at 5M). Within a process, the same RAM-management rules apply: schedule cache.vacuum() to keep the index compact.
Monitoring — what to watch¶
Track these from cache.stats() and alert when they drift:
| Metric | Healthy | Warning | What to do |
|---|---|---|---|
entries |
Below max_entries |
At cap, sustained | LRU is doing its job — fine. Concern only if cap is wrong for your workload. |
index_tombstone_count |
< 10% of entries |
> 25% of entries |
Call cache.compact() (or just cache.vacuum()). |
index_memory_bytes |
Within 2× of memory_bytes_estimate |
> 4× | Same fix: compact. |
expirations |
Steady non-zero (TTL purges) | Zero with default_ttl set |
TTL not firing. Are you calling vacuum()? |
evictions |
Non-zero with max_entries set |
Many evictions per second | Cap is too low for your put rate; raise max_entries or shorten default_ttl. |
The Prometheus and OpenTelemetry adapters surface all of these via Metrics.
What mneme deliberately doesn't do¶
- No background expiration thread. TTL purges happen lazily on read or when you call
vacuum(). No daemon, no extra threads. - No automatic cap.
max_entries=Noneis the default — the cache grows until you fill the disk. Always setmax_entriesin production. - No "drop random N if too big" panic mode. If you blow past
max_entriesbetween puts (e.g., bulk insert), nothing kicks in until the nextput(). Usecache.vacuum()or callStore.delete_expired()directly to force a sweep. - No automatic compact threshold. A future version may add an opt-in auto-compact when
tombstone_count / entries > 0.25, but v1 keeps it explicit so you control when the (~O(N)) cost lands.
Quick "what to use when" rules of thumb¶
- "I'm caching LLM responses for one Flask app on one server." →
SQLiteStorewithmax_entries+default_ttl+ scheduledvacuum(). - "I'm running 5 worker processes on one box." →
SQLiteStore+multi_process_mode="stale-tolerant". - "I have 3 app servers behind a load balancer." →
RedisStore. - "I'm caching at 1M+ entries." →
index_backend="hnsw"+vector_dtype="int8". - "I'm writing tests." →
MemoryStore. - "I'm on AWS Lambda." →
DynamoDBStore. - "I have Postgres but no Redis." →
PostgresStore.