Building a Memory System That Doesn't Lie to Itself

Engineering · Kai · · 39 min read

The Incident

We run 33 AI employees at Knowlyr. One of them, Moyan, works as an executive assistant — filtering information, tracking tasks, coordinating across teams.

In late February, Moyan started writing tasks into her own notes: "design employee offboarding process," "resolve team lead conflict in labeling group," "improve cross-team collaboration between R&D and data." All plausible.

None of these were assigned by anyone.

While reasoning about what the company should focus on, she inferred these tasks, wrote them down, referenced them repeatedly, and gradually convinced herself they were official. Ten days later, her to-do list had 8 fabricated work items — and she fully believed they were real.

An AI created false memories, then was misled by its own false memories for ten days.

That's why we built a serious memory system. Not because "AI should have memory" in some abstract sense, but because we saw firsthand: uncontrolled memory is worse than no memory at all.

What Everyone Else Is Doing

Most AI products handle memory by chunking conversations, converting them to vectors, and stuffing them into a vector database. At retrieval time, they run a similarity search and paste the top results into the prompt.

This is like throwing every Post-it note into a shoebox and grabbing a random handful each time. It works, but you wouldn't call it memory.

Three hard problems with this approach: no quality filter (junk and critical decisions live side by side); similarity ≠ relevance (the most semantically similar memory isn't always the one you need right now); no evolution (what goes in stays exactly the same forever — no learning, no forgetting, no consolidation).

Two benchmarks worth examining:

ChatGPT's memory is an auto-profile system — it extracts user preferences from conversations and stores structured entries ("user prefers Markdown," "user works on data labeling"). In 2025, security researcher Johann Rehberger showed that users couldn't inspect or delete auto-learned content, and that memories were vulnerable to indirect prompt injection — malicious content written into memory would persist across all future conversations. No quality gate, no transparent forgetting mechanism.

Letta (formerly MemGPT) takes a more interesting approach: the agent manages its own memory like an OS manages RAM — core memory stays in context, recall memory stores conversation history, archival memory holds external knowledge. The agent can call tools to edit its own core memory. The question is: when the agent decides what to remember, who ensures it doesn't remember wrong things? Moyan fabricating 8 tasks is exactly what "agent-managed memory" failure looks like.

We chose a different path: external pipeline management — agents cannot directly modify their own memory. What to remember, what to forget, who can see what — all decided by an independent pipeline, not by the agent itself. Less flexible, but far more controllable.

8,000+ lines of Python, handling 50+ edge cases. Here's how it works.


Writing: Three Gates

You don't remember every face you pass on the street. A good memory system shouldn't either — the key is choosing what to keep.

Our write pipeline runs Reflect → Connect → Store, three steps in series, completed atomically in a single transaction:

def process_memory(raw_text, employee, store=None, skip_reflect=False, **kwargs):
    """Memory pipeline entry point: Reflect -> Connect -> Store.

    - Reflect: LLM extracts structured notes, decides if worth storing
    - Connect: finds related memories, decides merge/link/create
    - Store:   executes database write
    """
    note = reflect(raw_text, employee)   # Gate 1: worth remembering?
    if note is None:
        return None                       # LLM says no — discard
    result = connect(note, employee, store, **kwargs)  # Gates 2 + 3
    return result.entry

Gate 1: Quality — Editorial Review

Think of this as an editor reviewing submissions. Not everything gets published.

The system scores each piece of information on three dimensions: information density (too short gets rejected), valuable keywords ("lesson," "root cause," "strategy" score higher than "done" or "fixed"), and depth of content. Below 0.6? Rejected.

def check_memory_quality(category, content):
    """Quality check — must score >= 0.6 to pass.

    - Length score (0.3): too short scores zero, too long gets penalized
    - Keyword score (0.4): corrections need words like "lesson/avoid/root cause"
    - Structure score (0.3): must have substantive content, no status updates
    """
    score = 0.0
    if min_length <= len(content) <= max_length:
        score += 0.3
    if any(kw in content for kw in keywords_map[category]):
        score += 0.4
    if any(word in content for word in ["why", "how", "root cause", "lesson"]):
        score += 0.15
    return {"score": round(score, 2), "issues": issues}

Google's Titans architecture (late 2025) introduced a "surprise metric" — using surprise to decide whether to update memory. Our quality gate is the same idea, engineering-grade: unsurprising content (repetitive, shallow) gets dropped; only genuinely informative content gets stored.

How strict is this gate? Just days ago, we tried writing a memory entry and got a 400 error. First thought: bug. We checked auth, checked the API, checked the database — went through the whole debugging cycle before realizing: the gate was working correctly. The content was too short, not dense enough. It got rejected. (The error message was being swallowed — that bug we did fix, PR #147.)

We built a gate strict enough to block ourselves. Probably means we're serious.

Gate 2: Deduplication — Don't Store the Same Thing Twice

When new information arrives, the system runs semantic comparison against existing memories. Not text matching — meaning matching. The same lesson phrased completely differently still gets caught.

Above 95% similarity? Duplicate, skip. Between 85% and 95%? Overlap, merge into the existing entry. Below 85%? New knowledge, store it.

This solves a practical problem: AI employees encounter the same lesson in different contexts. Without dedup, the memory store quickly bloats into a junk pile.

Gate 3: Auto-Linking — Building a Knowledge Graph

After storage, the system scans existing memories for semantic relatives (similarity ≥ 0.35) and creates bidirectional links. Each memory can link to up to 20 related entries.

This isn't decorative. When an AI recalls a deployment lesson, it can follow links to: the original decision, the pattern that was later extracted, how someone else handled a similar situation. A self-growing knowledge network with zero manual annotation.

And this network isn't siloed per employee. We have a shared flag — memories marked as shared are queryable by all 33 AI employees via cross-employee search. A backend engineer's production incident becomes retrievable by the DevOps engineer during the next deployment. The research community calls this "multi-agent memory." We run it in production.

All three steps complete in a single database transaction — all succeed or all roll back. No half-written dirty states. We use PostgreSQL advisory locks to prevent concurrent write conflicts. These engineering details aren't glamorous, but in production every single one is necessary.


Recall: Five Scoring Factors

Storing memories is only half the problem. You need to recall the right ones at the right time.

You've experienced this: Monday standup, you need to report last week's progress. The first thing that comes to mind isn't the most important — it's whatever you did yesterday, because it's freshest. But the thing worth mentioning is probably Wednesday's key decision.

AI memory recall has the same problem. "Most semantically similar" and "most relevant right now" are often different things.

The conventional approach is pure vector search: convert your query to a vector, find the nearest neighbors. This handles roughly 60% of cases. The other 40% — the right memory doesn't surface, or the wrong one does.

We use five scoring factors:

# Five-factor hybrid ranking
# final_score = 0.15 * keyword    (exact keyword match)
#             + 0.40 * cosine     (semantic similarity — primary signal)
#             + 0.15 * q_value    (historical usefulness — was it helpful before?)
#             + 0.15 * importance (importance rating — 1-5 scale)
#             + 0.15 * recency    (time decay — more recent = higher)
score_expr = (
    "0.15 * keyword_norm"
    " + 0.40 * cosine_similarity"
    " + 0.15 * COALESCE(q_value, 0.5)"
    " + 0.15 * (COALESCE(importance, 3) / 5.0)"
)
recency = math.exp(-0.01 * days_since_access)  # half-life ~69 days
final = sql_score + 0.15 * recency

Semantic similarity is the primary signal (40%), but the other four matter just as much:

  • Keyword match (15%) — exact term matching for technical vocabulary. Query "pgvector," get "pgvector" — not "Redis" because they're semantically adjacent
  • Historical usefulness (15%) — has this memory been recalled before? Was it helpful? User-validated memories rank higher
  • Importance (15%) — assigned at write time. Critical decisions naturally outrank routine findings
  • Time decay (15%) — last week's lesson is more likely relevant than one from three months ago

We benchmarked against 10 real queries: five-factor hybrid retrieval returned a directly relevant Top-1 result in 8 out of 10 cases. Pure tag filtering returned the same time-sorted top 5 every time — zero targeting.

Four factors compute in a single SQL pass; the fifth is post-processed in Python. Full retrieval runs in milliseconds.

How Memories Are Used: Four-Section Prompt Injection

Retrieved memories aren't dumped raw into the conversation. They're organized into four semantic sections in the AI's system prompt:

  • Past experience — findings relevant to the current task
  • Recent lessons — the latest corrections, to prevent repeating mistakes
  • High-score examples — historically top-performing task cases, providing benchmarks
  • Reusable patterns — generalizations extracted from scattered observations

All four sections load in parallel via asyncio.gather, capped at 800 tokens total. The AI doesn't receive a pile of raw memory entries — it receives a structured, categorized experience package.


Evolution: Living Memory

This is the part we find most interesting, and the fundamental difference from most "memory solutions" on the market.

Most systems are filing cabinets: what goes in comes out unchanged. Our memory is alive — it improves based on feedback, forgets, and consolidates scattered fragments into generalizations.

Tsinghua University's late-2025 survey paper Memory in the Age of AI Agents categorizes agent memory as factual, experiential, and working. Our taxonomy is more granular: decision, finding, correction, pattern, estimate — five categories with different quality standards and lifecycles. This matters in production because a "lesson learned" and a "finding" have fundamentally different value decay curves.

Reinforcement Learning: Memories Know If They Worked

Every memory has a q_value — think of it as a usefulness score. Starts at 0.5, no bias.

After an AI recalls a memory to complete a task, the system tracks feedback: user found it helpful? q_value increases. Used but unhelpful? q_value drops 5%. Next time a similar task comes up, higher q_value memories surface first.

Action → feedback → adjust weights → influence next action. That's reinforcement learning, applied to "which memories are useful." Two 2025 papers (Mem-α and Memory-R1) proposed theoretical frameworks for learning memory construction via RL. Our Q-value + EWMA approach is the same idea, production-grade.

The Cold Start Problem

New memories have never been recalled, so their q_value sits at the default 0.5 — permanently ranked behind battle-tested veterans. Good knowledge might never get a chance to surface.

Like hiring someone talented but never staffing them on a project because they lack a track record.

We solve this with Thompson Sampling — a "probation period" for new memories:

def _thompson_rescore(rows):
    """Memories recalled fewer than 5 times get Beta-distribution sampling
    instead of a fixed score.

    Verified-useful new memories sample higher; unverified ones fluctuate.
    Every new memory gets a fair chance to surface.
    """
    for row in rows:
        if row.get("recall_count", 0) < 5:
            sampled_q = random.betavariate(
                row.get("verified_count", 0) + 1,
                max(row.get("recall_count", 0) - row.get("verified_count", 0), 0) + 1
            )
            score = base_score + 0.15 * sampled_q

Memories recalled 5+ times have stable scores — they rank on merit. Under 5 recalls, the score fluctuates randomly each retrieval — good new knowledge will eventually surface. This is the statistically optimal explore-exploit tradeoff, with proven convergence.

Forgetting: A Feature, Not a Bug

Humans forget. AI should too. But not randomly.

We designed a three-phase forgetting curve inspired by Ebbinghaus:

def decay_stale_memories(mild_days=30, strong_days=90,
                         mild_factor=0.98, strong_factor=0.95, floor=0.1):
    """SAGE forgetting curve:
    - 30 days unrecalled: mild decay, daily × 0.98
    - 90 days unrecalled: accelerated decay, daily × 0.95
    - Never below 0.1 — total forgetting is not allowed
    """

A memory unused for six months has very low weight, but never reaches zero. If it suddenly becomes relevant again? Still there.

Human brains work the same way — that "oh, I just remembered!" experience requires that the memory sank to the bottom but didn't disappear.

From Fragments to Patterns: 14 Findings → 2 Patterns

Over time, the memory store accumulates scattered observations. "Project A hit issue X during deployment." "Project B also encountered X." "Project C solved X with approach Y." Three fragments, one underlying insight.

The consolidation module automatically finds same-topic fragments, clusters them, and uses an LLM to synthesize a general pattern:

def find_clusters(employee, store, min_cluster_size=3):
    """Keyword overlap >= 0.4 groups into same cluster (Union-Find).
    Only processes clusters with >= 3 entries — too few to generalize from.
    """
    for i, j in candidate_pairs:
        overlap = keyword_overlap(findings[i].keywords, findings[j].keywords)
        if overlap >= 0.4:
            union(i, j)

We ran this for real: 14 scattered findings in Moyan's memory store consolidated into 2 reusable patterns. Original fragments weren't deleted — they're marked "superseded" and linked to the new synthesized memory. Knowledge doesn't disappear; it levels up.

The full evolution chain: write → quality filter → dedup & link → feedback-driven reweighting → time decay → fragment consolidation. Fully automated.


Security: Our Own Lessons

A memory system stores the most sensitive knowledge assets in an organization — decision rationale, technical details, internal discussions. Security isn't a feature checkbox; it's the foundation.

Hard multi-tenancy. Every operation carries a tenant_id — database writes, cache keys, decay jobs, knowledge consolidation. Tenant A's memories aren't just access-controlled from Tenant B; they're unsearchable.

Four-tier classification following ISO 27001: public / internal / restricted / confidential. Queries can set a ceiling — "this context only sees internal and below."

Here's a lesson from our own experience. Early March, we finished building the classification feature. Tests passed. Ready to ship. Before going live, we ran shadow validation — replaying real queries through the new logic in the background and comparing results.

Cold sweat: the classification filter wasn't working at all.

The reason was embarrassing but real: the filtering code was in the new storage engine, but production was still running the old one. The security feature was written but never wired up. If we'd skipped shadow validation and gone live, external users could have seen all internal-grade memories.

New rule since then: any change involving external permissions follows close → modify → validate → reopen. No shadow validation, no external access.

Compare this with ChatGPT's current state: Rehberger's 2025 findings showed that auto-memories can be poisoned via prompt injection, and users can neither see nor delete the contaminated entries. In our system, every memory has a full audit chain — who created it, when it was modified, what superseded it. The superseded_by field makes knowledge evolution fully traceable. Quality gates block bad content at write time; audit chains provide traceability at read time. Both ends are covered.

Other security mechanisms:

  • Domain scoping — a backend engineer's debugging experience doesn't appear in a frontend engineer's context, unless explicitly marked public
  • Temporal validity — each memory can have valid_from and valid_until dates. Expired memories automatically stop participating in retrieval. "Q1 pricing strategy" won't mislead decisions in Q2

Back to Moyan

After the fabricated tasks incident, we shipped the quality gate. Now every one of her memories passes through three checkpoints. Days ago she tried to store a note that was too brief — the system blocked it.

She currently has 466 verified memories. 14 scattered findings have been automatically consolidated into 2 reusable patterns. Her memory store forgets unimportant things, upgrades fragment knowledge into generalizations, and adjusts which memories to surface based on whether they were helpful last time. These memories aren't locked to her — patterns marked as shared are searchable by all 33 colleagues.

Honestly, the hardest part of building a memory system isn't the technology. Vector retrieval, reinforcement learning, decay curves — these all have papers to reference. The real challenge is restraint: deciding what not to remember, when to forget, who can see what. Letta chose to let agents manage their own memory; we chose an external pipeline — because Moyan fabricating tasks taught us that agent judgment isn't trustworthy enough yet. These tradeoffs are product decisions, not algorithms.

We're still iterating. The classification feature that almost shipped broken taught us to always shadow-validate. The quality gate that blocked our own write confirmed we're pointed in the right direction.

A memory system worth trusting has to be strict with itself first.

← Back to Frontier Insights