Building a Production-Grade AI Agent Memory System: Architecture, Tradeoffs, and What I Learned
AI agents are stateless by default. Every conversation starts fresh. For anything beyond a single session - ongoing projects, evolving user preferences, long research threads - this is a fundamental limitation, not a UX problem.
The naive fix is to prepend conversation history to every prompt. This works until the history grows longer than your context window, at which point you start truncating. Truncation loses information. Increasing context window size increases cost quadratically. Neither approach scales.
What you actually need is a retrieval system - something that stores all historical information and fetches only what's relevant to the current query. This sounds like a vector database problem. It mostly isn't.
This post is about MemoryOS: what I built, why the architecture looks the way it does, and what the hard problems turned out to be.
Why vector databases alone fail for conversational memory
Vector databases are built for document retrieval. You embed a chunk of text, store it, and at query time you find the most semantically similar chunks via approximate nearest neighbor search. This works well when your chunks are self-contained - a paragraph from a research paper, a product description, a FAQ entry.
Conversations are not self-contained. They have three properties that break the standard retrieval model:
1. Anaphoric references. In a standard embedding model, "I moved last month" and "Alice Chen moved to Seattle in January 2024" are very different vectors. But they encode the same fact. If you chunk a conversation at sentence boundaries, roughly 40% of chunks contain pronouns or implicit references that only make sense in context. Those chunks become semantically invisible - no query will retrieve them for the right question because their embedding doesn't capture who "I" is.
2. Temporal state. Facts about users change. "Alice lives in Boston" was true in 2022; she moved to Seattle in 2024. A flat vector store has no mechanism to distinguish these. Both embeddings live in the index. The retrieval system returns both with roughly equal probability. The agent confidently reports a stale fact.
3. Entity ownership. "What's the user's address?" is semantically similar to any chunk that contains an address - including a different user's address. Vector similarity is content-based, not relationship-based. It doesn't understand that facts belong to specific entities.
These three failures all stem from the same root cause: embedding models encode surface semantics, not relational structure. To fix them properly, you need a layer that tracks entities, their attributes, and how those attributes change over time.
Architecture overview
MemoryOS uses three systems on top of a single PostgreSQL instance:
┌─────────────────────────────────────────────────────────┐
│ Ingestion Pipeline │
│ raw text → entity extraction → sliding window │
│ enrichment → dual embedding → triple extraction → │
│ append-only knowledge graph │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ PostgreSQL 16 │
│ chunks (vcontent, vlatent, vsparse) + pgvector HNSW │
│ entities table + graph_edges (append-only) │
│ memory_events (audit log) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Retrieval Engine │
│ query expansion → parallel ANN + entity extraction │
│ → hybrid score → Cohere rerank → graph boost │
└─────────────────────────────────────────────────────────┘
The key design decision was putting everything in Postgres. Most production RAG setups run a vector database (Pinecone, Weaviate, Qdrant) alongside a relational database for metadata and a graph database (Neo4j, Neptune) for relationships. That's three systems to operate, three consistency models to reason about, three failure modes. Postgres with the pgvector extension handles all three, and the operational simplicity is worth the performance tradeoff at the scale I care about.
Ingestion pipeline
Sliding window enrichment
The core ingestion technique is what I call sliding window enrichment. For each chunk C_i in a sequence:
context_window = chunks[i-3 : i+1]
enriched_C_i = LLM(C_i | context_window)
The LLM rewrites the chunk to be self-contained: pronouns are resolved to their referents, implicit references are expanded, temporal context is made explicit. "I moved last month" becomes "Alice Chen (software engineer) moved from Boston to Seattle in January 2024."
The enriched version gets its own embedding (vlatent), stored separately from the raw embedding (vcontent). At retrieval time, both vectors are searched. vlatent carries more weight in the hybrid score (0.40 vs 0.30) because it resolves the anaphora problem - queries about Alice will now find chunks that originally contained only pronouns.
The cost: one LLM call per chunk at ingestion. For a 550-message conversation history, that's 550 API calls. This is the primary reason infer=OFF exists as a mode - for bulk loading, you skip enrichment and accept lower recall quality on pronoun-heavy queries.
Triple extraction and the knowledge graph
Alongside enrichment, a second LLM call extracts relationship triples from each chunk:
(Alice Chen, lives_in, Seattle)
(Alice Chen, works_at, TechCo)
(Alice Chen, reports_to, Bob Martinez)
These triples are written to a graph_edges table with a critical constraint: it is append-only. Edges are never updated or deleted. When a fact changes, the old edge gets tvalid_end = now and a new edge is inserted. The history is preserved.
-- When Alice moves to Seattle:
UPDATE graph_edges
SET tvalid_end = now()
WHERE subject_id = alice_id AND predicate = 'lives_in' AND tvalid_end IS NULL;
INSERT INTO graph_edges (subject_id, predicate, object_literal, tvalid_start)
VALUES (alice_id, 'lives_in', 'Seattle', now());
This design enables temporal queries. "Where did Alice live in 2022?" becomes a range query on tvalid_start and tvalid_end. The vector store can't answer this question; the knowledge graph can.
Three-vector substrate
Each chunk is stored with three representations:
vcontent: Dense embedding of the raw text. 384-dimensional,all-MiniLM-L6-v2run locally.vlatent: Dense embedding of the enriched text. Same model and dimension.vsparse: BM25 term weights stored as JSONB. Handles exact keyword matching for proper nouns and technical terms that embedding models tend to conflate with related concepts.
All three are indexed. vcontent and vlatent use pgvector HNSW:
CREATE INDEX chunks_vcontent_hnsw
ON chunks USING hnsw (vcontent_vec vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
Retrieval pipeline
Hybrid scoring
A query generates a single hybrid score per candidate chunk:
score = α · cosine(q, vcontent) -- α = 0.30
+ β · cosine(q, vlatent) -- β = 0.40
+ γ · bm25(q, vsparse) -- γ = 0.15
+ δ · graph_proximity(q) -- δ = 0.15
× decay_score -- Ebbinghaus memory curve
vlatent gets the highest weight because the enriched embedding most reliably captures what a chunk is about. BM25 handles the cases where embedding similarity fails - proper nouns, product names, acronyms. Graph proximity boosts chunks connected to entities mentioned in the query.
The weights were chosen empirically on a small validation set. A better approach would be to learn them per-tenant using retrieval feedback, which is on the roadmap.
Graph-augmented retrieval
When a query arrives, we extract entities from it using spaCy (local, no LLM call). Those entities are looked up in the entity registry. We then do a BFS traversal of the knowledge graph up to depth 3:
# Pseudocode
seed_ids = resolve_entities(query_text)
frontier = seed_ids
for depth in range(1, 4):
edges = db.query(
graph_edges,
where=(subject_id.in_(frontier) | object_id.in_(frontier)),
and=(tvalid_end.is_(None)) # currently active facts only
)
chunk_ids |= {e.chunk_id for e in edges}
graph_scores[e.chunk_id] = max(score, 1.0 / depth)
frontier = newly_discovered_entity_ids
Chunks retrieved via graph traversal are merged with chunks retrieved via vector ANN. The graph proximity score contributes to the hybrid score above.
This is the piece that addresses entity ownership. A query about "Alice's address" finds chunks connected to the Alice entity specifically, not just any chunk that contains address-like text.
Reranking
The top-20 candidates by hybrid score go through Cohere's rerank-v3.5 API. The reranker uses a cross-encoder architecture - it jointly encodes the query and each document rather than encoding them independently. This is more expensive than embedding-based scoring but more accurate for subtle relevance distinctions.
Why Cohere instead of a local cross-encoder? The local model (ms-marco-MiniLM-L-6-v2) takes ~500ms per pair on CPU, so ranking 20 documents takes ~10 seconds. Cohere's API processes all 20 in ~450ms. The latency argument completely dominates the cost argument at this scale.
The Ebbinghaus decay engine
Memory should fade. If a user mentioned their address three years ago and has since moved, the old chunk should be de-prioritized without being deleted.
I implemented a decay function based on the Ebbinghaus forgetting curve:
def retention(chunk, now):
t = (now - chunk.last_retrieved).days
S = BASE_STABILITY * (1 + REINFORCEMENT * chunk.retrieval_count)
return math.exp(-t / S)
# BASE_STABILITY = 30 days
# REINFORCEMENT = 0.5 (each retrieval adds stability)
decay_score is multiplied into the hybrid score at retrieval time. A chunk that hasn't been accessed in 90 days will score roughly e^(-3) ≈ 0.05 of its original weight. When it drops below 0.1, it's archived - excluded from default retrieval but preserved in the append-only store and queryable with include_archived=true.
This addresses a practical problem: as a user's history grows, retrieval quality degrades because there are more chunks competing for the same query. Decay naturally prunes stale information without requiring explicit deletion logic.
The reinforcement term means that frequently-retrieved memories resist forgetting - the system learns which facts matter to the user based on actual usage patterns.
What I got wrong
Enrichment blocks ingest. Running LLM enrichment synchronously in the ingestion path was the wrong default. For a 550-message history, it means waiting 30+ minutes before the system is queryable. The right architecture is async enrichment: store raw embeddings immediately, enrich in the background. The raw embedding is good enough for most queries; enrichment improves recall for the pronoun-heavy subset.
Query expansion has asymmetric cost. I implemented LLM-based query expansion (generating N paraphrases per query) because it improves recall on multi-hop reasoning questions. But it adds ~9 seconds to every query latency. The solution was rule-based expansion as the default - pronoun normalization ("I prefer" → "the user prefers"), keyword extraction, tense variants for temporal queries. This handles 85% of cases at near-zero latency. LLM expansion is now an opt-in mode for complex queries.
Python cosine similarity is embarrassingly slow. My first implementation computed cosine similarity in a Python loop over 200 candidates × 1536 dimensions. Replacing this with a pgvector HNSW ANN search dropped that step from 217ms to 20ms. The lesson: profile before optimizing, but also recognize when you're reinventing a solved problem.
Performance (on a 48-core Azure VM, no GPU)
| Metric | Value |
|---|---|
| Ingest throughput (raw, batch) | 9ms/msg |
| Ingest throughput (enriched) | 5–7s/msg (3 parallel LLM calls) |
| Fast query latency (warm, p50) | 79ms |
| Thinking query latency (warm, p50) | 470ms |
| Thinking query latency (warm, p95) | 640ms |
| LongMemEval-s single-session recall | ✓ 82% correct (100 question tested) |
The fast query (79ms) breaks down as: entity extraction ~0ms (spaCy, parallel with embedding), embedding ~14ms (local model), ANN ~20ms (pgvector HNSW), graph traversal ~0ms (simple SQL), hybrid scoring ~5ms (numpy), reranking skipped. Total: ~40ms server-side, ~79ms wall clock including HTTP overhead.
The thinking query (470ms) adds Cohere reranking (~450ms) and rule-based query expansion (~0ms). The dominant cost is the reranker API round-trip.
What the end user actually gets
MemoryOS is not a standalone product. It's a layer in a larger stack:
User query
→ MemoryOS.query() [<100ms]
→ top-k relevant memories returned
→ injected into LLM prompt as context
→ LLM generates structured response
→ user sees coherent answer grounded in their history
The value MemoryOS adds is that the LLM's context window always contains the right memories - not necessarily the most recent ones, not necessarily all of them, but the ones most relevant to the current query given everything the system knows about the user's history. The LLM still does the reasoning, language generation, and answer structuring. MemoryOS does the retrieval and temporal reasoning.
This separation of concerns is deliberate. Retrieval and generation have different optimization targets and different failure modes. Conflating them in a single system makes both harder to debug.
What's next
The multi-session reasoning category on LongMemEval-s is where current systems struggle most - including ours. Questions like "how has the user's opinion changed over time?" require synthesizing facts across multiple sessions, tracking entity evolution, and reasoning about temporal sequences. The knowledge graph has the data to answer these questions; the retrieval system isn't yet good enough at surfacing the right cross-session evidence.
The session summarization system (which generates structured summaries when sessions close and stores them as elevated-priority chunks) is the most promising path. It's implemented but not validated on the full benchmark yet.
Code is at github.com/Per0x1de-1337/MemoryOS.