Skip to content

mneme

A layered semantic cache for LLM applications.

mneme (Greek: μνήμη, "memory"; pronounced NEE-mee) is an embeddable, in-process Python library that caches LLM completions across paraphrased queries. It pairs an exact-match layer (normalized query hash) with a semantic-match layer (cosine similarity over L2-normalized embeddings) and persists durably to a single SQLite file by default.

from mneme import SemanticCache

with SemanticCache(path="cache.db", embedder=my_embedder) as cache:
    hit = cache.get("How do I reset my password?")
    if hit is None:
        response = call_my_llm("How do I reset my password?")
        cache.put("How do I reset my password?", response)
    else:
        response = hit.response

Why mneme

  • Cache before you call. A semantic cache turns redundant LLM calls into a microsecond dict lookup or a millisecond NumPy matvec. For chatbots, agent loops, and batch-style scoring jobs, this is the difference between a viable product and one that burns tokens on every paraphrase.
  • One required dependency. NumPy. Optional extras for hnswlib, redis, psycopg, boto3, prometheus, opentelemetry. Bring your own embedder, your own LLM client, your own server.
  • In-process, no daemon. A library you import, not a service you operate. Persists to a single SQLite file by default; swap in Redis / Postgres / DynamoDB when you need shared state across hosts.
  • Strict typing, zero magic. Public surface is a small set of frozen @dataclasses and Protocols. py.typed shipped.

Pick your path

  • Get started


    Sync and async quickstarts, write your first cached LLM call, choose an embedder.

    Getting started

  • Understand the moving parts


    The two-layer cache, embedders, quantization, multi-process modes, multi-tenant.

    Concepts

  • Pick a store


    Memory, SQLite, Redis, Postgres, DynamoDB - same Protocol, five backends.

    Stores

  • API reference


    Every public class, method, type, and exception, autogenerated from the source.

    Reference

  • How-to guides


    Calibration, checkpoints, re-embed migrations, metrics, custom stores, perf tuning.

    Guides

  • See it live


    A Flask showcase classifies real customer-support messages with Nemotron on a local DGX Spark.

    Showcase

  • Beyond LLM caching


    Five other patterns the same machinery covers - RAG retrieval, translation, dedup, classification, agent memory.

    Use cases

  • How is this different?


    Where mneme makes deliberately different choices than other semantic-cache libraries, and the design philosophies behind them.

    Differentiation

Status

v1.0, released 2026. The public surface in mneme/__init__.py is locked; future minor versions are additive.

License

Apache 2.0.