Building an Autonomous Code Evolution Pipeline with HyperAgents
The Problem
Large language models are powerful, but they are fundamentally static. You prompt them, they respond, and their behavior is locked at deployment time. If you want an LLM-based agent to get better at a task, you typically need to either fine-tune its weights (expensive, slow, data-hungry) or hand-engineer better prompts (brittle, non-scalable).
What if the agent could modify its own source code based on empirical feedback?
That's the premise behind HyperAgents — a framework developed at Meta that wraps an LLM-powered agent in an evolutionary meta-loop. The agent reads its own codebase, identifies weaknesses from evaluation reports, proposes code edits, and the system tests those edits against a benchmark. If the edits improve performance, they become the starting point for the next iteration. If they don't, the system falls back to a better-performing ancestor.
This post covers the architecture of the pipeline, the infrastructure challenges we encountered while running it, and the results from our first sustained evolution experiment.
Architecture
The Generate Loop
The core of HyperAgents is generate_loop.py, an iterative pipeline that manages the full lifecycle of code evolution:
setup_initial_gen → [for each generation]:
1. Select a parent (latest, best, or proportional to score)
2. Build/launch a Docker container with the HyperAgents codebase
3. Apply the parent's lineage patches (cumulative git diffs)
4. Run the meta-agent inside the container (6-hour wall-clock timeout)
5. The meta-agent produces model_patch.diff
6. Evaluate the modified agent on the benchmark
7. Save results, update the archive graph
Each generation runs in an isolated Docker container. The meta-agent — itself an LLM with tool access (file viewer, string replacer, creator) — is given the full codebase, evaluation reports from the parent generation, and a budget of iterations in which to read files, diagnose issues, and propose fixes.
The Meta-Agent
The meta-agent (meta_agent.py) is the brain of the operation. It receives:
- Eval context: accuracy metrics, precision/recall per class, failed/passed sample IDs, and up to 3 truncated failed chat histories from the previous evaluation.
- Repo access: the full HyperAgents codebase via an editor tool supporting
view,str_replace,create,insert, andundo_editcommands. - Iteration budget: a counter (
iterations_left) that limits how many tool calls the meta-agent can make.
The meta-agent's job is to figure out why the agent performed poorly and fix it. This requires a surprisingly sophisticated chain of reasoning: read the eval report → understand the task → find the relevant code (typically task_agent.py) → read the prompt → identify the failure mode → craft a precise str_replace edit → verify the edit was applied.
The Task Agent
The task agent (task_agent.py) is what actually gets evaluated. For our primary benchmark — paper review classification — it receives a paper's text and must classify it as "accept" or "reject." The meta-agent can modify the task agent's prompts, prediction extraction logic, and any other aspect of its behavior.
After the meta-agent finishes, run_meta_agent.py resets the domains/ directory to the base commit (preventing the meta-agent from cheating by modifying the evaluation harness), then the system evaluates the modified agent on the benchmark.
Infrastructure Challenges
Running this pipeline on a self-hosted server introduced several non-trivial infrastructure problems.
Docker Permission Isolation
Containers run as root, but the host user doesn't have root privileges. This creates a fundamental ownership conflict: when Docker containers write files (evaluation outputs, __pycache__/ directories), those files are owned by root on the host. The next generation can't clean them up with shutil.rmtree().
Our fix: a Docker-based cleanup that mounts the parent directory and runs rm -rf inside a container as root, bypassing the host permission barrier entirely.
Git Dubious Ownership
Git 2.35.2+ refuses to operate on repositories where the .git directory is owned by a different user than the current process. Since the host copies the repo (owned by user janus) into containers (running as root), every git operation inside the container would fail.
Fix: git config --global --add safe.directory '*' injected immediately after every container start.
__pycache__ Contamination
Python's bytecode cache directories were being included in model_patch.diff files via git diff, producing diffs full of binary garbage instead of meaningful code changes. This made it appear that the meta-agent had produced a "57-line diff" when in reality it was just permission changes on .pyc files.
Fix: inject __pycache__/ into .gitignore during setup_initial_gen(), and ensure the repository copy includes a proper .gitignore before the initial commit.
Meta-Agent Context Bloat
The initial evaluation reports contained malformed prediction labels — full paper review texts used as dictionary keys — causing the _load_eval_context() function to inject ~350KB of noise into the meta-agent's instruction. This consumed most of the context window, leaving little room for actual code exploration.
Fix: sanitize report.json before injection, truncating the label_distribution.prediction field and summarizing failed predictions rather than dumping raw text.
Iteration Starvation
The original code calculated iterations_left = max_generation - current_genid. With max_generation=3, gen 1 got 2 iterations, gen 2 got 1, and gen 3 got zero. The meta-agent needs 5+ iterations just to read the relevant source files — with 0 iterations, it produces nothing.
Fix: decouple iteration budget from generation count with a --meta_iterations flag (we use 50), while keeping the 6-hour wall-clock timeout as the actual hard limit.
Results: janus_test_008
Our most recent run (janus_test_008) ran 10 generations over ~2 hours with the following configuration:
- Domain: paper_review (binary accept/reject classification)
max_generation: 10meta_iterations: 50eval_samples: 50 (train and val)- Parent selection: proportional to child scores
Generation Summary
| Gen | Train | Val | Diff Type |
|---|---|---|---|
| initial | 10% (1/10) | 90% (9/10) | baseline |
| 1 | 50% (25/50) | 70% (35/50) | 57 code lines |
| 2–3 | — | — | empty |
| 4 | 54% (27/50) | 58% (29/50) | perm-only |
| 5–6 | — | — | empty |
| 7 | 60% (30/50) | 60% (30/50) | perm-only |
| 8 | — | — | empty |
| 9 | 60% (30/50) | 62% (31/50) | perm-only |
| 10 | — | — | empty |
What Happened in Gen 1
Generation 1 produced the only meaningful code change across the entire run: a 57-line diff to task_agent.py consisting of:
- Prompt simplification: Replaced a verbose, multi-paragraph peer-review instruction with a concise 4-line directive listing explicit accept/reject criteria.
- 3-tier prediction extraction: Added a fallback chain — (a) try JSON
responsekey, (b) search all JSON values for "accept"/"reject", (c) regex word-boundary search on raw LLM output, using the last mention of accept or reject as the prediction.
This single edit moved val accuracy from 90% (on 10 samples — essentially random) to 70% (on 50 samples — a meaningful signal).
Why Subsequent Generations Failed
The meta-agent consistently produced empty diffs in generations 2–10, despite chat histories growing to 4MB (indicating extensive exploration). Two failure modes:
-
Editing the wrong files: The meta-agent would modify files in
domains/,meta_agent.py, or other infrastructure code. These changes get reverted byreset_paths_to_commit(), which resets thedomains/directory to the base commit after the meta-agent runs. Only changes totask_agent.pysurvive. -
str_replacematching failures: The meta-agent'sold_strparameter must match the file content exactly (afterexpandtabs()). Even minor whitespace or escaping differences cause the edit to fail silently. The meta-agent receives an error message but often fails to recover, burning its remaining iterations on retry attempts.
Accuracy Stagnation
After gen 1's improvement, accuracy plateaued. The train accuracy drifted upward (50% → 60%) while val accuracy stayed flat (70%), suggesting the meta-agent's occasional permission-only changes (which survive as file mode diffs) interact with the LLM's stochastic output in ways that help on train but don't generalize.
The 70% val accuracy represents a model that is biased toward rejection (predicting "reject" ~70% of the time) in a dataset that is ~60% reject. The meta-agent has not yet found an edit that corrects this residual bias.
Next Steps
Several concrete improvements could push the pipeline past the 70% ceiling:
- Constrain meta-agent scope: Explicitly instruct the meta-agent to only modify
task_agent.py, reducing wasted iterations on files that will be reverted. - Fuzzy
str_replacematching: Replace exact string matching with a more tolerant approach (e.g., normalized whitespace, fuzzy substring matching) to reduce edit failures. - Larger eval samples: 50 samples introduces significant variance. Moving to 200+ would give more reliable accuracy estimates and reduce the chance of regression appearing as improvement.
- Multiple domains: Running the pipeline on the sudoku domain (which uses a TypeScript bridge to our cognitive architecture, The Demon) would test whether the meta-agent can evolve non-prompt code.
Conclusion
HyperAgents is a fascinating framework that treats AI agent improvement as a code evolution problem rather than a weight optimization problem. The infrastructure challenges are real but solvable, and the pipeline is now stable enough to run multi-hour experiments without crashes.
The key insight from our experiments: the bottleneck isn't the LLM's ability to understand code or diagnose problems. It's the edit execution layer — the meta-agent consistently identifies the right file and proposes the right class of fix, but fails to land the edit due to string-matching fragility. Improving that single component could unlock dramatically better evolutionary trajectories.
This is the first in a series of technical posts from Arachnida Apps documenting our work on autonomous AI evolution. Follow along at laplace.arachnida-apps.com.