Bring your own embedder¶
mneme is bring-your-own-embedder. The library never imports an embedder package; you wire one up against the Embedder Protocol.
The contract¶
A sync embedder satisfies the Embedder Protocol if it provides three things:
class MyEmbedder:
@property
def dim(self) -> int:
"""Length of the vector returned by embed()."""
@property
def fingerprint(self) -> str:
"""A stable string identifying the (model, dim, normalization) tuple.
Reused across runs; the cache validates it on open()."""
def embed(self, text: str) -> numpy.ndarray:
"""Return a 1-D float32 vector of shape (self.dim,)."""
For async, the same shape with async def embed(text) -> numpy.ndarray.
The full Protocol - both sync and async - is documented in the API reference.
Conventions that pay off¶
Three properties that make the cache happy:
- L2-normalize the vector before returning. Cosine similarity is the dot product of unit vectors; the cache assumes you've normalized. If you skip this, hits land further from
1.0than they should and your threshold needs to drift. - Make
fingerprintdeterministic and specific. Include the model name, the dimension, and any normalization or instruction-prefix flags. The cache refuses to mix incompatible vectors viaEmbedderMismatchError- that's only useful if your fingerprint actually changes when the model changes. - Treat
dimas a property of the model, not a parameter. It should matchembed()'s output exactly. If you can configure the model to return a smaller vector (e.g. OpenAI'sdimensions=parameter), bake the chosen dim into the fingerprint so the cache notices.
Thread- and task-safety¶
SemanticCache serializes every public method through a single RLock, but it intentionally drops that lock around your embedder call so concurrent gets can fan out. That means the cache assumes:
- Sync
Embedder.embed()is thread-safe. Multiple threads may call it concurrently. - Async
AsyncEmbedder.embed()is task-safe. Multipleasynciotasks may await it concurrently.
Most network-backed embedders (OpenAI, Bedrock, Ollama HTTP) are fine — each call opens its own client connection. Local model embedders are where this bites:
sentence-transformerson CPU — generally thread-safe.sentence-transformerson a single GPU — concurrent calls can race the GPU stream and produce garbage vectors. Wrap in a single-thread executor or anasyncio.Semaphore(1).- Batch embedders that mutate internal state — also serialize.
If your embedder isn't safe under concurrency, the simplest fix is a serializing wrapper:
import asyncio
class SerializedAsyncEmbedder:
"""Wraps an AsyncEmbedder so concurrent embed() calls run one at a time."""
def __init__(self, inner):
self._inner = inner
self._sem = asyncio.Semaphore(1)
@property
def dim(self): return self._inner.dim
@property
def fingerprint(self): return self._inner.fingerprint
async def embed(self, text):
async with self._sem:
return await self._inner.embed(text)
The cost is no concurrency on the embed step itself — but the rest of the cache (Layer-1 hash lookup, Layer-2 matvec, store writes) still runs concurrently across requests, so this only matters if your embedder is the bottleneck.
Reference embedders¶
Each reference snippet is a real-world starting point. Copy into your own code; mneme never imports them. Sources live at examples/reference_embedders/.
text-embedding-3-small (1536-dim) or text-embedding-3-large (3072-dim). Sync + async pair.
import numpy as np
class OpenAIEmbedder:
def __init__(self, client, *, model="text-embedding-3-small", dimensions=None):
self._client = client
self._model = model
self._dimensions = dimensions
@property
def dim(self) -> int:
if self._dimensions is not None:
return self._dimensions
return {"text-embedding-3-small": 1536, "text-embedding-3-large": 3072}[self._model]
@property
def fingerprint(self) -> str:
return f"openai:{self._model}:dim{self.dim}"
def embed(self, text: str) -> np.ndarray:
kwargs = {"model": self._model, "input": text}
if self._dimensions is not None:
kwargs["dimensions"] = self._dimensions
resp = self._client.embeddings.create(**kwargs)
v = np.asarray(resp.data[0].embedding, dtype=np.float32)
n = float(np.linalg.norm(v))
return v / n if n > 0 else v
Async version replaces the client with openai.AsyncOpenAI and embed() becomes async def. See examples/reference_embedders/openai_embedder.py for both.
Local CPU/GPU model, no network call. Great for cost-free, offline, or privacy-sensitive deployments.
import numpy as np
class SentenceTransformersEmbedder:
def __init__(self, model_name="all-MiniLM-L6-v2", *, device=None, normalize=True):
from sentence_transformers import SentenceTransformer
self._model_name = model_name
self._model = SentenceTransformer(model_name, device=device)
self._normalize = normalize
self._dim = int(self._model.get_sentence_embedding_dimension())
@property
def dim(self) -> int:
return self._dim
@property
def fingerprint(self) -> str:
return f"sentence-transformers:{self._model_name}:n{int(self._normalize)}:dim{self._dim}"
def embed(self, text: str) -> np.ndarray:
v = self._model.encode(text, normalize_embeddings=self._normalize, convert_to_numpy=True)
return v.astype(np.float32, copy=False)
For an async cache, wrap with to_async_embedder (see Async quickstart).
Titan Text Embeddings v2 (256/512/1024 dim) or Cohere Embed (1024 dim). Sync only via boto3.
import json
import numpy as np
class BedrockTitanEmbedder:
def __init__(self, client, *, model_id="amazon.titan-embed-text-v2:0", dimensions=1024, normalize=True):
self._client = client
self._model_id = model_id
self._dim = dimensions
self._normalize = normalize
@property
def dim(self) -> int:
return self._dim
@property
def fingerprint(self) -> str:
return f"bedrock:{self._model_id}:dim{self._dim}:n{int(self._normalize)}"
def embed(self, text: str) -> np.ndarray:
body = json.dumps({"inputText": text, "dimensions": self._dim, "normalize": self._normalize})
resp = self._client.invoke_model(modelId=self._model_id, body=body)
payload = json.loads(resp["body"].read())
v = np.asarray(payload["embedding"], dtype=np.float32)
if not self._normalize:
n = float(np.linalg.norm(v))
v = v / n if n > 0 else v
return v
See examples/reference_embedders/bedrock_embedder.py for Cohere too.
Self-hosted local model. Default port 11434.
import numpy as np
import requests
class OllamaEmbedder:
def __init__(self, model="nomic-embed-text", *, url="http://localhost:11434", dim=768):
self._model = model
self._url = url.rstrip("/")
self._dim = dim
@property
def dim(self) -> int:
return self._dim
@property
def fingerprint(self) -> str:
return f"ollama:{self._model}:dim{self._dim}"
def embed(self, text: str) -> np.ndarray:
resp = requests.post(
f"{self._url}/api/embeddings",
json={"model": self._model, "prompt": text},
timeout=30,
)
resp.raise_for_status()
v = np.asarray(resp.json()["embedding"], dtype=np.float32)
n = float(np.linalg.norm(v))
return v / n if n > 0 else v
Async version uses httpx.AsyncClient. See examples/reference_embedders/ollama_embedder.py.
Picking a model dimension¶
Bigger isn't always better. Higher-dim embeddings:
- Cost more memory in the in-memory matrix (
d × n × 4 bytesfor fp32). - Cost more memory bandwidth on every
search()(d × n × 4 bytesto read + matvec). - Are not necessarily more discriminative for short queries.
For chatbot intent classification or paraphrase detection on short queries (10–30 tokens), 384–768 dim usually beats 1536 on cost without losing meaningful accuracy. For long-context semantic search, 1024+ helps.
Trade memory for either smaller models (all-MiniLM-L6-v2 is 384) or int8 quantization of larger ones.
What if I get the contract wrong¶
| Mistake | What happens |
|---|---|
embed() returns a list, not a numpy.ndarray |
The cache will np.asarray() it but slowly. Convert once at the source. |
Vector dtype is float64 |
Forced down to float32 on insert. Costs a copy per put/get. Cast at the source. |
| Vector is not L2-normalized | Threshold needs to drift; calibration produces lower thresholds. Normalize. |
fingerprint is the same after a model change |
The cache happily mixes vectors from two models. Garbage matches. Always change the fingerprint when the model changes. |
dim doesn't match embed()'s output |
cache.put() raises a shape error. Make dim a property of the actual model. |
Where to go next¶
- Your first cached LLM - wrapping an actual LLM call.
- Embedders concept page - fingerprints, dim mismatches, the rebuild-on-mismatch path.
- Calibration - finding the right
similarity_thresholdfor your embedder.