Memory Evolution Pipeline

A self-improving memory system inspired by Meta's HyperAgents research. Traces become episodes, episodes yield knowledge, knowledge drives better performance — an autonomous quality loop that gets smarter with every run.

Overview

The Memory Evolution Pipeline transforms raw agent execution traces into structured, actionable knowledge. Rather than treating memory as a static store, Maximus treats it as a living system that continuously refines itself through a seven-stage pipeline. Each stage feeds into the next, creating a compounding improvement cycle where better data produces better decisions.

Inspired by Meta's HyperAgents paper on self-improving agent architectures, the pipeline combines deterministic heuristics with LLM-powered extraction to build an ever-growing knowledge graph that agents draw on at execution time.

Pipeline Architecture 7 Stages

The pipeline runs as a periodic background process, processing completed traces through each stage in sequence.

1  Scan & Distill

Scans for completed execution traces and distills them into compact episode summaries. Raw traces can be thousands of lines; distillation extracts the essential decisions, outcomes, and context into a structured episode record.

2  Metrics Collection

Computes quality metrics for each episode: token counts, tool call frequencies, success/failure ratios, and execution duration. These metrics feed the promotion engine and help identify high-value episodes worth deeper extraction.

3  Entity & Triple Extraction

LLM-powered extraction identifies entities (tools, APIs, strategies, error patterns) and relationships (triples) from episode summaries. Extracted knowledge is stored in the knowledge graph as nodes and edges with provenance tracking.

4  Promotion

Metric-driven promotion decides which episodes graduate from working memory into long-term knowledge. High performers are promoted quickly; low performers are held back for further evidence before committing to the graph.

5  Briefing Generation

Generates pre-execution briefings for agents by querying the knowledge graph for relevant context. Briefings include known strategies, common pitfalls, and domain-specific knowledge tailored to the upcoming task.

6  Data Pruning

Removes stale or low-confidence knowledge from the graph. Entities and triples that haven't been reinforced by recent episodes decay over time, keeping the knowledge base lean and current.

7  Trace Pruning

Archives or deletes raw trace files once their knowledge has been fully extracted and promoted. Prevents unbounded disk growth while preserving the distilled intelligence in the graph.

The Quality Loop

The pipeline creates a self-reinforcing feedback loop. Each component's output improves the input quality of the next, producing compounding returns over time.

Better traces
    -> Better episodes
        -> Better knowledge
            -> Better briefings
                -> Better performance
                    -> Better traces  // cycle repeats

As agents perform better with improved briefings, the traces they produce contain richer decision-making data, which in turn yields higher-quality episodes and knowledge extractions. This virtuous cycle means the system improves autonomously with each execution.

Extraction Evolution & Prompt Versioning

Extraction prompts evolve over time. Each version is tracked by its SHA-256 hash, enabling A/B comparisons and rollback when quality regresses.

SHA-256 Prompt Hashing

Every extraction prompt is hashed before use. The hash is stored alongside each extraction result, creating an immutable audit trail linking every piece of extracted knowledge back to the exact prompt that produced it.

// Prompt version tracking
prompt_hash  = sha256("Extract entities and triples from...")
              // e.g. "a3f8c1d2..."

// Stored with every extraction
extraction_record = {
  episode_id:   "ep-20260327-001",
  prompt_hash:  "a3f8c1d2...",
  entities:     [...],
  triples:      [...]
}

Quality Metrics

The extraction_metrics table tracks per-prompt-version performance, enabling data-driven prompt evolution.

Metric Description Target
Entities per Episode Average number of unique entities extracted from each episode summary >= 5
Triples per Episode Average number of relationship triples (subject-predicate-object) per episode >= 8
Unique Entity Ratio Proportion of extracted entities that are genuinely new vs. duplicates of existing graph nodes >= 0.3
Triple Confidence Average confidence score assigned to extracted triples by the LLM >= 0.7
Extraction Latency Wall-clock time for a single episode extraction pass < 30s

Metric-Driven Promotion

Not all episodes are created equal. The promotion engine uses quality scores to decide which episodes earn a place in long-term knowledge.

High Performers
Score > 80%
Promoted immediately to the knowledge graph. Their entities and triples are merged with high confidence, and the source episode is flagged as a reference exemplar.
Mid-Range
Score 30-80%
Held in a staging area. Knowledge is tentatively added with lower confidence weights. Requires corroboration from additional episodes before full promotion.
Low Performers
Score < 30%
Held back entirely. Episodes with low quality scores may contain hallucinated or irrelevant extractions. They remain in working memory and are candidates for re-extraction with improved prompts.
Promotion Velocity
Adaptive
The promotion threshold adapts based on overall graph quality. As the graph matures, the bar rises to prevent dilution of high-quality knowledge with marginal additions.

Strategy Discovery

As the pipeline processes traces, it identifies recurring patterns of successful behavior and codifies them as named strategies. These strategies become first-class entities in the knowledge graph.

Discovered Strategy Examples

set-leverage-before-orders

When trading, always configure account leverage before placing any orders. Prevents rejection errors and ensures correct position sizing across different margin requirements.

use-jina-for-protected-sites

When scraping content from sites with bot protection (Cloudflare, rate limits), route requests through Jina reader API instead of direct fetch. Higher success rate and cleaner output.

batch-upload-over-50-leads

When importing more than 50 leads into a campaign, use the bulk upload API endpoint instead of individual create calls. Reduces API calls by 10-100x and avoids rate limiting.

Strategy Registry

Discovered strategies are tracked in a SQLite registry that records usage counts, success rates, and agent affinity. The registry enables data-driven strategy recommendations in briefings.

Column Type Description
strategy_id TEXT PK Kebab-case identifier (e.g. set-leverage-before-orders)
description TEXT Human-readable summary of the strategy
usage_count INTEGER Number of times the strategy has been applied
success_rate REAL Ratio of successful outcomes when strategy was used
agent_id TEXT Agent that discovered or most frequently uses this strategy
discovered_at DATETIME Timestamp of first observation
last_used_at DATETIME Timestamp of most recent application

Future Directions

Automated Prompt Evolution
Use extraction quality metrics to automatically generate and test new prompt variants, selecting the best performers through A/B testing against the metrics table.
Vector Embeddings
Augment the knowledge graph with vector embeddings for semantic similarity search. Enable agents to find relevant knowledge even when exact entity names differ.
LLM Distiller
Replace rule-based trace distillation with an LLM-powered distiller that can extract nuanced reasoning chains and implicit decision patterns from raw execution logs.
Cross-Swarm Knowledge
Share strategies and knowledge across independent agent swarms. A trading agent's risk management strategies could inform a deployment agent's rollback decisions.