Memory Evolution Pipeline
A self-improving memory system inspired by Meta's HyperAgents research. Traces become episodes, episodes yield knowledge, knowledge drives better performance — an autonomous quality loop that gets smarter with every run.
Overview
The Memory Evolution Pipeline transforms raw agent execution traces into structured, actionable knowledge. Rather than treating memory as a static store, Maximus treats it as a living system that continuously refines itself through a seven-stage pipeline. Each stage feeds into the next, creating a compounding improvement cycle where better data produces better decisions.
Inspired by Meta's HyperAgents paper on self-improving agent architectures, the pipeline combines deterministic heuristics with LLM-powered extraction to build an ever-growing knowledge graph that agents draw on at execution time.
Pipeline Architecture 7 Stages
The pipeline runs as a periodic background process, processing completed traces through each stage in sequence.
Scans for completed execution traces and distills them into compact episode summaries. Raw traces can be thousands of lines; distillation extracts the essential decisions, outcomes, and context into a structured episode record.
Computes quality metrics for each episode: token counts, tool call frequencies, success/failure ratios, and execution duration. These metrics feed the promotion engine and help identify high-value episodes worth deeper extraction.
LLM-powered extraction identifies entities (tools, APIs, strategies, error patterns) and relationships (triples) from episode summaries. Extracted knowledge is stored in the knowledge graph as nodes and edges with provenance tracking.
Metric-driven promotion decides which episodes graduate from working memory into long-term knowledge. High performers are promoted quickly; low performers are held back for further evidence before committing to the graph.
Generates pre-execution briefings for agents by querying the knowledge graph for relevant context. Briefings include known strategies, common pitfalls, and domain-specific knowledge tailored to the upcoming task.
Removes stale or low-confidence knowledge from the graph. Entities and triples that haven't been reinforced by recent episodes decay over time, keeping the knowledge base lean and current.
Archives or deletes raw trace files once their knowledge has been fully extracted and promoted. Prevents unbounded disk growth while preserving the distilled intelligence in the graph.
The Quality Loop
The pipeline creates a self-reinforcing feedback loop. Each component's output improves the input quality of the next, producing compounding returns over time.
Better traces
-> Better episodes
-> Better knowledge
-> Better briefings
-> Better performance
-> Better traces // cycle repeats
As agents perform better with improved briefings, the traces they produce contain richer decision-making data, which in turn yields higher-quality episodes and knowledge extractions. This virtuous cycle means the system improves autonomously with each execution.
Extraction Evolution & Prompt Versioning
Extraction prompts evolve over time. Each version is tracked by its SHA-256 hash, enabling A/B comparisons and rollback when quality regresses.
SHA-256 Prompt Hashing
Every extraction prompt is hashed before use. The hash is stored alongside each extraction result, creating an immutable audit trail linking every piece of extracted knowledge back to the exact prompt that produced it.
// Prompt version tracking prompt_hash = sha256("Extract entities and triples from...") // e.g. "a3f8c1d2..." // Stored with every extraction extraction_record = { episode_id: "ep-20260327-001", prompt_hash: "a3f8c1d2...", entities: [...], triples: [...] }
Quality Metrics
The extraction_metrics table tracks per-prompt-version performance, enabling data-driven prompt evolution.
| Metric | Description | Target |
|---|---|---|
| Entities per Episode | Average number of unique entities extracted from each episode summary | >= 5 |
| Triples per Episode | Average number of relationship triples (subject-predicate-object) per episode | >= 8 |
| Unique Entity Ratio | Proportion of extracted entities that are genuinely new vs. duplicates of existing graph nodes | >= 0.3 |
| Triple Confidence | Average confidence score assigned to extracted triples by the LLM | >= 0.7 |
| Extraction Latency | Wall-clock time for a single episode extraction pass | < 30s |
Metric-Driven Promotion
Not all episodes are created equal. The promotion engine uses quality scores to decide which episodes earn a place in long-term knowledge.
Promoted immediately to the knowledge graph. Their entities and triples are merged with high confidence, and the source episode is flagged as a reference exemplar.
Held in a staging area. Knowledge is tentatively added with lower confidence weights. Requires corroboration from additional episodes before full promotion.
Held back entirely. Episodes with low quality scores may contain hallucinated or irrelevant extractions. They remain in working memory and are candidates for re-extraction with improved prompts.
The promotion threshold adapts based on overall graph quality. As the graph matures, the bar rises to prevent dilution of high-quality knowledge with marginal additions.
Strategy Discovery
As the pipeline processes traces, it identifies recurring patterns of successful behavior and codifies them as named strategies. These strategies become first-class entities in the knowledge graph.
Discovered Strategy Examples
When trading, always configure account leverage before placing any orders. Prevents rejection errors and ensures correct position sizing across different margin requirements.
When scraping content from sites with bot protection (Cloudflare, rate limits), route requests through Jina reader API instead of direct fetch. Higher success rate and cleaner output.
When importing more than 50 leads into a campaign, use the bulk upload API endpoint instead of individual create calls. Reduces API calls by 10-100x and avoids rate limiting.
Strategy Registry
Discovered strategies are tracked in a SQLite registry that records usage counts, success rates, and agent affinity. The registry enables data-driven strategy recommendations in briefings.
| Column | Type | Description |
|---|---|---|
strategy_id |
TEXT PK |
Kebab-case identifier (e.g. set-leverage-before-orders) |
description |
TEXT |
Human-readable summary of the strategy |
usage_count |
INTEGER |
Number of times the strategy has been applied |
success_rate |
REAL |
Ratio of successful outcomes when strategy was used |
agent_id |
TEXT |
Agent that discovered or most frequently uses this strategy |
discovered_at |
DATETIME |
Timestamp of first observation |
last_used_at |
DATETIME |
Timestamp of most recent application |