SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation

TL;DR

Standard retrieval-augmented agents treat long-term memory as a static library indexed by semantic similarity. That breaks the moment a question depends on a cause, a timeline, or a chain of facts that isn't lexically close to the query. SYNAPSE reframes memory the way cognitive science describes human recall — as spreading activation, energy propagating through a graph — so the agent can surface what is structurally relevant even with zero word or embedding overlap with the query. On the LoCoMo long-horizon benchmark it sets a new state of the art (40.5 weighted F1, ahead of Zep at 39.7 and +7.2 over A-Mem, ranked first in every category), lifts multi-hop reasoning by up to 23%, and rejects 96.6% of adversarial queries — while using ~95% fewer tokens and running ~4× faster than full-context methods.

The bug isn't forgetting. It's isolation.

Picture an agent that's been talking with a user for weeks. The user asks:

"Why am I feeling so anxious today?"

A vector-RAG memory does the obvious thing: it embeds the query and pulls back the memories nearest to anxiety — recent messages about stress, bad sleep, a tense week. Reasonable. Except the actual cause was a scheduling conflict the user logged three weeks ago. That note never used the word "anxiety." In embedding space it sits nowhere near the query. So it stays buried, and the agent gives a fluent, confident, slightly-wrong answer.

We call this failure mode Contextual Isolation, and it comes from an assumption baked into almost every RAG system — the Search Assumption: that a memory's relevance is determined by its semantic proximity to the current query.

That assumption is fine for fact lookup. It collapses for anything requiring causal, temporal, or transitive reasoning. Hierarchical managers like MemGPT[4] improve where context lives, but they're still query-driven retrievers: they can't autonomously surface a memory that's structurally connected yet semantically distant. For a system whose entire value proposition is accumulating experience over a long horizon, that's a load-bearing flaw — not a rough edge.

Recall is propagation, not search

Here's the idea we borrowed, and it's an old one. Spreading-activation theory (Collins & Loftus, 1975[1]) and Anderson's ACT-R architecture (1983[2]) model human memory retrieval not as a search but as a flow of energy. A concept lights up; the things connected to it — semantically, temporally, causally — light up too, without you deliberately looking them up. Say "ski trip" and your mind quietly surfaces who you went with and what happened after, even though you never asked.

That "one cue pulls a chain" behavior is exactly what flat vector stores throw away. SYNAPSE (Synergistic Associative Processing & Semantic Encoding) operationalizes it: a query injects energy into a memory graph, the energy propagates along temporal and causal edges, and memories that are structurally salient get prioritized — even with zero lexical or embedding overlap with the query.

The SYNAPSE architecture

Four pieces fit together. None is exotic on its own; the result comes from making them interact.

The bridge-node effect. Dual triggers inject energy at "Kendall" (lexical) and "Ski Trip" (semantic). Activation propagates through "Mark" — a concept that never appears in the query — and reaches the evidence "they broke up last week," which shares no words or embedding proximity with the question. Lateral inhibition simultaneously suppresses the semantically similar but irrelevant distractor.

3.1 A unified episodic–semantic graph

Memory is a directed graph with two node types, mirroring the psychological distinction between episodic and semantic memory:

Episodic nodes — each raw interaction turn, storing its text, a dense embedding (all-MiniLM-L6-v2, 384-d), and a timestamp.
Semantic nodes — abstract concepts (entities, preferences, events) the LLM consolidates from the dialogue every five turns, with embedding-level deduplication so the concept vocabulary stays canonical.

Three edge types decide how relevance can travel: temporal edges chain consecutive episodes, abstraction edges link episodes to the concepts they mention, and association edges connect concepts to each other.

One engineering note so this scales: to avoid O(|V|²) blowup over a long deployment, each node keeps only its top-15 incoming edges, and nodes whose activation stays dormant across ten consolidation windows are archived to disk — the active graph stays compact (≤10K nodes) and query latency stays independent of history length.

3.2 Spreading activation: the fan effect and lateral inhibition

The graph is the substrate; the dynamics are where the cognitive science earns its keep. Dual-trigger anchoring seeds the process: BM25 for exact entity hits (precision) and dense retrieval for conceptual matches (recall). Then, each iteration, a node's potential updates by propagation with the fan effect (from ACT-R):

$$u_i^{(t+1)} \;=\; (1-\delta)\,a_i^{(t)} \;+\; \sum_{j \in \mathcal{N}(i)} \frac{S \cdot w_{ji}\, a_j^{(t)}}{\operatorname{fan}(j)}$$

where $S$ is the spreading factor, $\delta$ a retention decay, and $\operatorname{fan}(j)$ the out-degree of the sender. Temporal edges carry weight $e^{-\rho\,\Delta t}$ — old links transmit less — and semantic edges carry cosine similarity.

Dividing by $\operatorname{fan}(j)$ is what stops high-frequency hubs like "weekend" or "airport" from hoarding all the activation and drowning out specific signal: a node with many connections dilutes the energy it passes along. This is the architectural answer to the "hub explosion" that plagues naive random walks over dense graphs. Lateral inhibition then applies winner-take-all competition — the top-M activated nodes suppress their competitors in proportion to the activation gap — before a sigmoid squashes the result into a firing rate. The loop runs strictly as propagate → inhibit → activate and stabilizes within three iterations, so retrieval cost is bounded and small.

3.3 Triple-signal hybrid retrieval

Final scoring fuses three deliberately orthogonal signals:

$$\mathcal{S}(v) \;=\; \lambda_1 \cdot \underbrace{\operatorname{sim}(\mathbf{h}_v, \mathbf{h}_q)}_{\text{semantic}} \;+\; \lambda_2 \cdot \underbrace{a_v^{(T)}}_{\text{activation}} \;+\; \lambda_3 \cdot \underbrace{\operatorname{PageRank}(v)}_{\text{structural}}$$

PageRank is a global structural prior — the hubs that matter no matter what you asked (the main characters of a long conversation). Activation is the local, query-specific signal that just propagated. Semantic similarity is the familiar direct match. Keeping them decoupled means a novel-but-locally-important detail doesn't get steamrolled by a global hub. Defaults: λ = {0.5, 0.3, 0.2}; sensitivity analysis shows robustness across λ₃ ∈ [0.1, 0.3] and retrieval depth k ∈ [20, 40].

3.4 Knowing when you don't know

The most under-appreciated part, and arguably the most useful in production: teaching the agent to say "no record of that." SYNAPSE adds a metacognitive layer inspired by the human Feeling of Knowing. Confidence gating: if retrieval confidence falls below a threshold (τ = 0.12, calibrated on a held-out split to keep false refusals under 2.5%), the system fires a negative acknowledgement — it refuses before generating, the way a brain inhibits a response when the memory trace is too weak. Explicit verification: for borderline cases, a hard-constrained prompt forces the question — is this EXPLICITLY in memory? If not, say "Not mentioned."

This targets the scariest failure of memory-augmented agents: memory hallucination, where the model confidently invents an event that never happened. And importantly, the headline results don't ride on refusals: with the gate disabled entirely, SYNAPSE still averages 40.3 — the structural retrieval carries its own weight.

What's actually new

Graph memory for agents isn't new, so it's fair to ask where SYNAPSE differs:

System	Core mechanism	Gap for long-horizon agents
MemGPT / MemoryOS[4]	OS-style context paging	Memories stay independent text units — no relational structure
GraphRAG[6]	Community detection + global summaries	Corpus-level sense-making; too coarse for minute-level episodes
HippoRAG[5]	Personalized PageRank over a KG	Static pre-indexed corpora — no O(1) incremental write, no time decay
Zep / AriGraph[7]	Temporal knowledge graphs	Structure without dynamics — relevance still computed per item
SYNAPSE	Cognitive activation dynamics on a live episodic-semantic graph	—

Positioning against existing agentic-memory systems. The one-line answer: prior systems add structure; SYNAPSE adds dynamics on top of structure — fan-effect dilution, lateral inhibition, temporal decay — and the ablations in §5 show the dynamics, not the topology, do the heavy lifting.

Results

We evaluate on LoCoMo[3]: long-horizon dialogues averaging ~16K tokens across up to 35 sessions, scored across five categories. All improvements are statistically significant (paired t-test, p < 0.05, N = 500), and results are stable across three random seeds (±0.2).

Method	Multi-Hop	Temporal	Open-Domain	Single-Hop	Adversarial	Weighted Avg*	Rank
A-Mem	27.0	45.9	12.1	44.7	50.0	33.3	4.8
AriGraph	28.5	43.2	14.5	45.1	48.5	33.7	4.6
MemoryOS	35.3	41.2	20.0	48.6	—	38.0	—
Zep	35.5	48.5	23.1	48.0	65.4	39.7	2.6
SYNAPSE (ours)	35.7	50.1	25.9	48.9	96.6	40.5	1.0

Main results on LoCoMo (F1, GPT-4o-mini backbone; strongest baselines shown). *Weighted average over the four non-adversarial categories — we deliberately exclude the adversarial column from our own headline number so near-perfect rejection cannot inflate it.

New SOTA: 40.5 weighted F1, ahead of Zep (39.7) and +7.2 over A-Mem (33.3), with a perfect task rank of 1.0 — first in every category.
Temporal: 50.1. Time-aware activation decay correctly prefers current facts over semantically similar but stale ones.
Multi-Hop: 35.7 vs 27.0. Activation relays relevance through intermediate nodes, reconnecting fact chains vector search leaves broken. (Under an LLM-as-judge protocol the gap widens: 84.2 vs 63.7 for MemoryOS.)
Adversarial: 96.6. Near-perfect refusal on questions about things that don't exist, versus baselines that happily hallucinate.

But the single result we'd point a skeptic to is the low-similarity stress test (Figure 2). Restrict the test set to cases where the evidence is deliberately far from the question in embedding space, and the semantic-only baseline collapses while SYNAPSE barely moves.

The low-similarity stress test. Restricting evaluation to questions whose gold evidence sits far from the query in embedding space, the semantic-only baseline loses over half its F1 while SYNAPSE degrades by less than 8% — the core thesis confirmed in one number: relevance can be recovered from structure when similarity is gone.

5.1 Three failures, side by side

Numbers aggregate; examples convince. The same three queries through a semantic-only baseline and SYNAPSE:

Query	Semantic-only baseline	SYNAPSE
Asks about a pet dog that was never mentioned	Retrieves "…toy dinosaur, Rex" → "She has a dog named Rex."	Confidence gate fires → "No record of such pet found."
"Where does Caroline live?"	Top hit (cos 0.92): "moved from Sweden 4 years ago…" → "She lives in Sweden."	Temporal decay boosts the recent episode → "Currently in the US."
"Would Caroline likely have Dr. Seuss books?"	No lexical bridge from "collects books" → "Uncertain / no info."	Walk: Caroline → classic books → Dr. Seuss → "Yes, likely."

Qualitative comparison of retrieval behaviors. Three different cognitive mechanisms — gating, decay, graph traversal — each catch the exact failure they were designed for.

5.2 Ablations: every mechanism is load-bearing

The ablations tell the story by subtraction. Remove node decay and temporal reasoning craters (50.1 → 14.2). Remove the fan effect and open-domain sags as hubs flood the graph (25.9 → 16.8). Remove lateral inhibition and even simple single-hop lookup degrades — noise competes with signal before the gate ever sees it. Strip everything back to vectors only and you land at 25.2 average. Each mechanism is load-bearing for a specific kind of reasoning, which is what you'd hope from a design that borrows distinct mechanisms from cognitive science rather than one big trick.

5.3 A note on metrics

Token-overlap F1 systematically undervalues memory systems that reason. A real example from our error analysis: asked "how long has she been practicing art?", the gold answer is "Since 2016"; SYNAPSE answered "Seven years" — arithmetically correct relative to the conversation's date — and scored F1 = 0.0. Under an LLM-as-judge protocol that scores semantic correctness, SYNAPSE reaches 80.7 overall, again first across systems. We report both, because n-gram metrics reward parroting and punish understanding; if you evaluate agentic memory, you should too.

5.4 Weaker backbones benefit more

One cross-backbone finding with real deployment consequences: on a small Qwen-3B backbone, SYNAPSE averages 36.6 F1 where MemoryOS gets 22.1 and A-Mem 16.2. Structured activation partially compensates for limited reasoning capacity — the retrieval stage exposes the relational evidence the small model can't infer on its own. Good memory lets you ship smaller, cheaper models without falling off a quality cliff.

Efficiency

The thing that makes this practical rather than a research curiosity:

Method	Tokens / query	Latency	Cost / 1K queries	F1 (excl. adv.)
Full-context (LoCoMo)	~16,910	8.2s	$2.67	25.6
MemGPT	~16,977	8.5s	$2.67	28.0
MemoryOS	~1,198	1.5s	$0.30	38.0
SYNAPSE (ours)	~814	1.9s	$0.24	40.5

Efficiency profile on LoCoMo (GPT-4o-mini; cost = total API cost at standard rates). Graph-construction costs amortize over the agent's lifetime and are negligible per-query.

Cost–accuracy trade-off across memory systems. Because activation retrieves a small relevant subgraph instead of stuffing the whole history into context, SYNAPSE runs at ~95% fewer tokens and ~11× lower cost than full-context methods, at ~4× the speed — while scoring higher. On cost-efficiency (F1 per dollar: 167.3) it tops the field.

Where it breaks (because it does)

Three honest limitations, each a research direction:

Cold start. Spreading activation needs a connected topology. In a brand-new conversation with almost no history, graph upkeep is overhead a plain linear buffer would beat.
Cognitive tunneling. The same lateral inhibition that sharpens focus can, on a simple query, over-prune a low-degree detail. Our favorite failure case: asked "what color was John's jacket?", a strongly activated "Airport Trip" hub suppressed the weakly-connected "green jacket" episode right below the pruning threshold. Aggressive focus has a cost, and we document it rather than hide it.
Text-only, for now. Embodied agents need vision and audio; folding image and audio nodes into the same graph via aligned embedding spaces is the obvious next step.

Memory that forgets on purpose

"A memory system that never forgets is a privacy liability wearing a feature costume."

Two properties of SYNAPSE's design matter here, and they fell out of the cognitive modeling rather than being bolted on:

Granular forgetting is native. Temporal decay and dormancy-based pruning naturally implement a "right to be forgotten": obsolete or unused memories lose activation and get archived out of the live graph, instead of persisting indefinitely in a vector store.
The graph is self-contained. An episodic-semantic graph is a compact, local artifact — it can live on-device or in an encrypted enclave, rather than requiring conversation history to be shipped wholesale to a third-party index.

If you're building agents that accumulate months of someone's personal context, these aren't nice-to-haves. They're the difference between a memory feature users trust and one they turn off.

From the paper to the product

This isn't an abstract benchmark exercise for us. The same problem SYNAPSE attacks — an agent that has to remember a person's evolving record across a long relationship and surface the right piece at the right time — is the problem at the center of EZCollegeApp, GyriQAI's AI workspace for U.S. undergraduate applications.

An application season runs for months and looks exactly like a long-horizon memory benchmark, except the stakes are a teenager's future: the AI counselor has to ground its advice in a student's actual record — profile, uploaded transcripts and documents, the evolving school list, where each essay draft stands — and carry that across many sessions instead of starting cold each time. Every failure mode in this post has a counseling twin:

Contextual isolation — a student's anxiety in January traces back to a deadline conflict logged in October.
Temporal staleness — the school list changed last week; advice anchored to September's list is worse than no advice.
Multi-hop — the robotics club mentioned in one session is the missing ingredient of an essay angle discussed in another.
Hallucination — an invented deadline or a fabricated "you told me…" is instantly trust-ending; the agent must be able to say "that's not in your record."

Deliberately, EZCollegeApp is a planning and organization tool. It works from the student's own materials and doesn't fabricate experiences; it supports a human counselor rather than replacing one; forecasts are guidance, not guarantees; and students still verify official deadlines and complete official submissions themselves. Episodic + semantic long-term memory is the difference between "a chatbot bolted onto an essay editor" and an assistant that actually remembers your case. That's the line of research SYNAPSE pushes on, and the line of product we're building toward — including the economics: a memory layer that makes small backbones perform (§5.4) and cuts tokens by 95% (§6) is what lets research-grade personalization ship at a price a family can afford.

Resources & citation

The full paper has the algorithms, hyperparameter sensitivity, cross-backbone results (GPT-4o, Qwen-1.5B/3B), and the qualitative analyses we couldn't fit here.

Paper — Findings of ACL 2026 (ACL Anthology link coming with the proceedings)
Code — github.com/hq0709/synapse: implementation, LoCoMo evaluation harness, ablations, runnable diagnostics
Product — ezcollegeapp.com

@inproceedings{jiang2026synapse,
  title     = {{SYNAPSE}: Empowering {LLM} Agents with Episodic-Semantic
               Memory via Spreading Activation},
  author    = {Jiang, Hanqi and Chen, Junhao and Pan, Yi and Chen, Ling
               and You, Weihang and Zhou, Yifan and Zhang, Ruidong
               and Abate, Yohannes and Liu, Tianming},
  booktitle = {Findings of the Association for Computational Linguistics:
               ACL 2026},
  year      = {2026}
}

If you're building agentic memory and have hit Contextual Isolation yourself — the moment your retriever confidently returns everything except the thing that mattered — I'd love to compare notes.

References

Collins, A. M., & Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psychological Review, 82(6), 407.
Anderson, J. R. (1983). A spreading activation theory of memory. Journal of Verbal Learning and Verbal Behavior, 22(3), 261–295.
Maharana, A., et al. (2024). Evaluating very long-term conversational memory of LLM agents. arXiv:2402.17753.
Packer, C., et al. (2024). MemGPT: Towards LLMs as operating systems. arXiv:2310.08560; Li, Z., et al. (2025). MemOS: A memory OS for AI systems. arXiv:2507.03724.
Gutiérrez, B. J., et al. (2024). HippoRAG: Neurobiologically inspired long-term memory for large language models. NeurIPS 37.
Edge, D., et al. (2025). From local to global: A graph RAG approach to query-focused summarization. arXiv:2404.16130.
Rasmussen, P., et al. (2025). Zep: A temporal knowledge graph architecture for agent memory. arXiv:2501.13956; Anokhin, P., et al. (2025). AriGraph: Learning knowledge graph world models with episodic memory for LLM agents. arXiv:2407.04363.