Quantization¶
mneme always persists vectors as float32. The in-memory index can use a smaller dtype - float16 or int8 - for less memory at the cost of some search-time work. Pick with vector_dtype= when you instantiate the cache:
What changes between dtypes¶
| dtype | bytes/element | 100k × d=1536 matrix | Cosine accuracy | Search latency |
|---|---|---|---|---|
float32 (default) |
4 | 614 MB | exact | fastest |
float16 |
2 | 307 MB | ~0.1% drift on typical embeddings | slightly slower (cast each search) |
int8 |
1 | 154 MB | 1–3% drift on typical embeddings | slower at high d (no fused int8 GEMM in NumPy) |
The store on disk is always float32 regardless. Quantization is purely an in-memory optimization, applied once when the index is built. You can switch dtypes between runs without rebuilding the store.
Picking one¶
flowchart TD
Start([Need to choose a dtype]) --> A{RAM tight?}
A -- no --> F32[float32<br/>default]
A -- yes --> B{Latency critical?}
B -- no --> I8[int8]
B -- yes --> C{High dim<br/>>=1024 ?}
C -- no --> F16[float16]
C -- yes --> D[float16<br/>or hnsw + int8]
style F32 fill:#0a3,color:#fff
style F16 fill:#06b,color:#fff
style I8 fill:#90c,color:#fff
style D fill:#a40,color:#fff
float32: the default. Use when memory is plentiful. Cleanest; no cast on every search; smallest accuracy drift (zero).float16: 2× memory cut for negligible accuracy loss. The cast on every search adds a few milliseconds at 100k × 1536 but matvec is BLAS-fast on the result. Best general-purpose memory optimization.int8: 4× memory cut. Symmetric quantization with implicit scale 127, assumes L2-normalized input. Search latency on pure NumPy at high dim is bandwidth-bound (~50 ms at 100k × 1536) because theint8 → float32cast dominates. The win is memory footprint, not speed. Calibrate the similarity threshold against int8 vectors if that's your prod dtype - cosine scores shift slightly.
If you need both small footprint and low latency at scale, pair int8 with index_backend="hnsw". hnswlib's approximate-NN search avoids the full matrix scan; the int8 cast cost mostly disappears.
Calibration matters more with int8¶
int8 quantization shifts cosine scores by 1–3% on most embeddings. A threshold of 0.78 on fp32 might land at 0.76 on int8 for the same paraphrase pair. Calibrate against the dtype you'll run in production:
from mneme.tools.calibrate import find_threshold
# This explicitly evaluates with int8-quantized vectors:
result = find_threshold(
paraphrase_pairs=...,
distractor_pairs=...,
embedder=...,
vector_dtype="int8", # ← match prod
target_metric="f1",
)
If you skip this step and copy a threshold from an fp32 calibration, you'll see slightly more misses (or false positives, depending on the drift direction) when you switch to int8. See Calibration.
Switching dtypes at runtime¶
cache.requantize(dtype) rebuilds the in-memory matrix at a new dtype without touching the store:
cache = SemanticCache(path="cache.db", embedder=embedder, vector_dtype="float32")
# ... cache fills up ...
cache.requantize("int8") # 4x memory cut, instant
Lossy when going down (fp32 → int8 throws away precision). For a precision-preserving upgrade (int8 → fp32), the cache reads the float32 source-of-truth from the store, so re-quantizing back to fp32 gives you the original vectors verbatim.
Memory math¶
Approximate in-memory matrix size: n × d × bytes-per-element.
For 100k entries:
| d | float32 | float16 | int8 |
|---|---|---|---|
| 384 | 154 MB | 77 MB | 38 MB |
| 768 | 307 MB | 154 MB | 77 MB |
| 1024 | 410 MB | 205 MB | 102 MB |
| 1536 | 614 MB | 307 MB | 154 MB |
| 3072 | 1.2 GB | 614 MB | 307 MB |
Multiply by 1.05–1.1 for index bookkeeping (offset arrays, namespace maps, tombstones). For 1M entries scale 10×.
SemanticCache.stats().memory_bytes_estimate reports the current matrix size live; the showcase dashboard surfaces it in the entry counter.
Where to go next¶
- Calibration - calibrate against your chosen dtype.
- Performance tuning - how dtype interacts with hnsw and chunked matvec.
- Performance baseline - measured latencies by dtype.