Memory as Substrate — CollectHive

Why this exists

If you’re reading this, it’s probably because I’ve asked whether you can help me think about agent memory. This is a calibration instrument — rather than me pitching in real-time and you reverse-engineering my level of understanding, I’ve written down what I currently know, what I want to build, and what I’m stuck on.

If after reading this you think I’ve got a blind spot, tell me. If the vision looks misdirected, tell me. If you can help with a specific piece, tell me that too.

What I’ve built

I run a memory system called Memory MCP — a Convex-backed store with vector search and an LLM reranker (Gemini 3 Flash) — used by two agent stacks I operate: Agent Amber (a Claude-based production system) and Vanilla (a GitHub Copilot-based development hub). The same memory substrate serves both.

The system has layered scopes: a _global scope for universal patterns, domain-level scopes, per-project scopes, and “seeding bank” scopes that act as staging grounds for promotion between layers. A manual promotion workflow moves a memory toward _global after roughly 20+ recalls across multiple projects, provided it has universal applicability.

Retrieval is scored on vector similarity (primary signal), plus recency, confidence decay, type and tier multipliers, and a small usage component. An LLM reranker sits on top of the vector stage and scores candidates 0–10 for relevance, discarding the injection entirely if the best score is under 5.

What I’ve learned

Memory is trying to do one job: inject the right piece of context at the right time. There are two flavours — tools proactively pushing context in, or the agent semantically pulling it — but the real work isn’t the mechanism. It’s the timing. A system that can decide when to inject or search, for the right piece of information, is the hard problem. Everything else is in service of that timing decision.

The harness is a storage layer in its own right. Claude Code has CLAUDE.md, SKILL.md, and hooks; these aren’t “memory,” but they hold information the agent reads every session. I think of memory as specifically the cross-session context that doesn’t fit the harness shape — or doesn’t want to live statically in a file a human maintains. If you put everything in memory, you overload retrieval. If you put everything in the harness, you lose adaptability.

Corrections are a distinct job memory does. Some memory exists specifically because the underlying LLM has a baked-in tendency to answer wrong. You can’t retrain the model, but you can inject a correction that overrides its prior when it’s about to apply it. Corrections are probably the highest-reliability use of memory because the measurement is simple: does the LLM still do the wrong thing, yes or no.

Persistent memory has to stay correct or it poisons. Stale memories don’t just become useless — they get acted on. From the incident log I keep across agent sessions, the single most common failure mode is the agent confidently stating something it “remembers” without checking whether reality has moved on. Memory amplifies that failure mode. A memory system without a freshness contract is a liability.

Manual curation doesn’t scale. I designed a promotion workflow where memories get routed from staging banks to their proper scope. In practice I don’t run it regularly enough. The feedback UI I built, where users rate whether an injected memory was helpful, has essentially zero submissions. People won’t curate in the flow. If the system doesn’t curate itself from implicit signal, it doesn’t get curated.

The closed loop is missing. My system writes injection data on every session — which memories were retrieved, which were injected, which reranker scores they got. Almost none of it gets read back. When I measured earlier this year, 93% of the memories I’d stored had never been retrieved in a live session, and two memories accounted for 39% of all injections. That’s not a ranking problem — it’s a telemetry problem. I’m recording the data that would tell me which memories matter; I’m just not using it.

Recall counts without circuit-breakers create a feedback loop. I originally had a recall_count field that fed into scoring: retrieved memories got slightly boosted, which made them more likely to be retrieved. This ran away fast — a few memories colonised the top of every result. I removed recall from scoring but didn’t replace it. So usage signal exists in the database but isn’t read anywhere. I want to put it back, safely. Analytics separated from scoring, probably. Exploration in the retrieval policy, probably. Something.

Everything ultimately lives in files. At the storage layer, a memory is a row in a database or a file on disk. The market and the open-source community are iterating on variations — different access patterns, different indexing strategies, different ways to self-improve a file-based store. There’s no canonical answer yet. I’m operating in an unsettled space.

Hook-dependent architectures don’t port. A lot of published memory systems — PAI is the clearest example — rely heavily on Claude’s hooks (pre/post tool-use, stop hooks, etc.) to do injection and curation. GitHub Copilot CLI has no hooks. I run both Claude and Copilot stacks, so any architecture I build needs to work without hooks on at least one side. That rules out some of the slicker published designs.

Gut feel, flagged as assumption: I believe agents interacting with memory through an MCP server is more effective than agents doing direct file search over a memory directory. I can’t prove it. It’s a bet on structured access with a known contract versus raw grep. If someone can tell me I’m wrong — or right, and for specific reasons — that would save me a lot of time.

Where I’m fuzzy

One honest admission: I don’t have a clear mental model of what actually happens when a prompt gets added to chat context, or what happens during compaction. I operate on intuition there, and intuition is a bad foundation for memory-system design because injection is adding to context, and compaction determines what survives. There are adjacent things I’m probably fuzzy on that I haven’t named because I don’t know enough yet to know what I’m missing.

What I want

Three layered claims, in order of end goal → enabler → substrate.

End goal — skills and workflows that self-improve. Our agents have skills (versioned, discoverable, per-project configurable) and workflows (written procedures for recurring tasks). Right now both are static artefacts: a human writes them, deploys them, and they stay that way until a human edits them. I want them to compound. A skill that’s been used successfully across twenty diverse sessions should get sharper. A workflow whose step three never actually gets followed should either prune step three or surface why it keeps getting skipped.

Enabler — memory as the substrate for signal. For skills and workflows to self-improve, they need evidence: usage outcomes, failure modes, successful patterns, activation contexts. That evidence is memory. The skill-system and the memory-system aren’t two separate projects — the memory system is the thing that lets the skill-system learn.

Substrate requirement — cross-project, self-curating, with defensible promotion. For memory to serve that role, it has to be trustworthy. That means: cross-project (signal from one project informs another where applicable), promotion logic that doesn’t require me personally running it, and self-optimisation that avoids the recall-count feedback-loop trap I already hit. Analytics separated from scoring. Probably some exploration in the retrieval policy so that dead weight gets a chance to be seen. A freshness contract that prevents staleness from poisoning outputs.

I don’t know exactly what the mechanism is. That’s part of why I want help.

Where I’d welcome help

I’m looking for someone willing to go deep — not a one-shot consult, not “here’s a link to Mem0,” but sustained attention. Specifically:

Become the expert on what’s in market. Mem0, Letta, Zep, PAI, MemGPT, whatever else. Someone who’s used or read the internals of multiple systems and can tell me what’s actually good versus ornamental.
Optimise memory within a skills-strong agent system. The memory system isn’t an end product; it’s the engine behind a skills-and-workflows stack that’s supposed to compound. That constraint matters. I don’t want a memory system that’s great in isolation but doesn’t serve the skill layer.
Validate or kill my assumptions. Is MCP-mediated memory actually better than file-based search, and under what conditions? Is there a circuit-breaker pattern for recall-count-in-scoring that avoids runaway concentration? What does a freshness contract look like in practice?

I’m less interested in: greenfield rewrites, “just use RAG,” or generic agent-architecture advice. I already have a working system with measured behaviour. I want someone who can help me make that system better, not replace it.

Closing

Memory is infrastructure for compounding. I’ve been treating it as the project. I think the real project is the thing memory enables — skills and workflows that get better over time — and memory is the substrate that needs to hold up under that weight. If you’ve read this and think you can help me make the substrate hold, tell me.