Building an Autonomous Code Evolution Pipeline with HyperAgents

hyperagentsmeta-learningcode-evolutionagi

The Problem

Large language models are powerful, but they are fundamentally static. You prompt them, they respond, and their behavior is locked at deployment time. If you want an LLM-based agent to get better at a task, you typically need to either fine-tune its weights (expensive, slow, data-hungry) or hand-engineer better prompts (brittle, non-scalable).

What if the agent could modify its own source code based on empirical feedback?

That's the premise behind HyperAgents — a framework developed at Meta that wraps an LLM-powered agent in an evolutionary meta-loop. The agent reads its own codebase, identifies weaknesses from evaluation reports, proposes code edits, and the system tests those edits against a benchmark. If the edits improve performance, they become the starting point for the next iteration. If they don't, the system falls back to a better-performing ancestor.

This post covers the architecture of the pipeline, the infrastructure challenges we encountered while running it, and the results from our first sustained evolution experiment.

Architecture

The Generate Loop

The core of HyperAgents is generate_loop.py, an iterative pipeline that manages the full lifecycle of code evolution:

setup_initial_gen → [for each generation]:
  1. Select a parent (latest, best, or proportional to score)
  2. Build/launch a Docker container with the HyperAgents codebase
  3. Apply the parent's lineage patches (cumulative git diffs)
  4. Run the meta-agent inside the container (6-hour wall-clock timeout)
  5. The meta-agent produces model_patch.diff
  6. Evaluate the modified agent on the benchmark
  7. Save results, update the archive graph

Each generation runs in an isolated Docker container. The meta-agent — itself an LLM with tool access (file viewer, string replacer, creator) — is given the full codebase, evaluation reports from the parent generation, and a budget of iterations in which to read files, diagnose issues, and propose fixes.

The Meta-Agent

The meta-agent (meta_agent.py) is the brain of the operation. It receives:

  • Eval context: accuracy metrics, precision/recall per class, failed/passed sample IDs, and up to 3 truncated failed chat histories from the previous evaluation.
  • Repo access: the full HyperAgents codebase via an editor tool supporting view, str_replace, create, insert, and undo_edit commands.
  • Iteration budget: a counter (iterations_left) that limits how many tool calls the meta-agent can make.

The meta-agent's job is to figure out why the agent performed poorly and fix it. This requires a surprisingly sophisticated chain of reasoning: read the eval report → understand the task → find the relevant code (typically task_agent.py) → read the prompt → identify the failure mode → craft a precise str_replace edit → verify the edit was applied.

The Task Agent

The task agent (task_agent.py) is what actually gets evaluated. For our primary benchmark — paper review classification — it receives a paper's text and must classify it as "accept" or "reject." The meta-agent can modify the task agent's prompts, prediction extraction logic, and any other aspect of its behavior.

After the meta-agent finishes, run_meta_agent.py resets the domains/ directory to the base commit (preventing the meta-agent from cheating by modifying the evaluation harness), then the system evaluates the modified agent on the benchmark.

Infrastructure Challenges

Running this pipeline on a self-hosted server introduced several non-trivial infrastructure problems.

Docker Permission Isolation

Containers run as root, but the host user doesn't have root privileges. This creates a fundamental ownership conflict: when Docker containers write files (evaluation outputs, __pycache__/ directories), those files are owned by root on the host. The next generation can't clean them up with shutil.rmtree().

Our fix: a Docker-based cleanup that mounts the parent directory and runs rm -rf inside a container as root, bypassing the host permission barrier entirely.

Git Dubious Ownership

Git 2.35.2+ refuses to operate on repositories where the .git directory is owned by a different user than the current process. Since the host copies the repo (owned by user janus) into containers (running as root), every git operation inside the container would fail.

Fix: git config --global --add safe.directory '*' injected immediately after every container start.

__pycache__ Contamination

Python's bytecode cache directories were being included in model_patch.diff files via git diff, producing diffs full of binary garbage instead of meaningful code changes. This made it appear that the meta-agent had produced a "57-line diff" when in reality it was just permission changes on .pyc files.

Fix: inject __pycache__/ into .gitignore during setup_initial_gen(), and ensure the repository copy includes a proper .gitignore before the initial commit.

Meta-Agent Context Bloat

The initial evaluation reports contained malformed prediction labels — full paper review texts used as dictionary keys — causing the _load_eval_context() function to inject ~350KB of noise into the meta-agent's instruction. This consumed most of the context window, leaving little room for actual code exploration.

Fix: sanitize report.json before injection, truncating the label_distribution.prediction field and summarizing failed predictions rather than dumping raw text.

Iteration Starvation

The original code calculated iterations_left = max_generation - current_genid. With max_generation=3, gen 1 got 2 iterations, gen 2 got 1, and gen 3 got zero. The meta-agent needs 5+ iterations just to read the relevant source files — with 0 iterations, it produces nothing.

Fix: decouple iteration budget from generation count with a --meta_iterations flag (we use 50), while keeping the 6-hour wall-clock timeout as the actual hard limit.

Results: janus_test_008

Our most recent run (janus_test_008) ran 10 generations over ~2 hours with the following configuration:

  • Domain: paper_review (binary accept/reject classification)
  • max_generation: 10
  • meta_iterations: 50
  • eval_samples: 50 (train and val)
  • Parent selection: proportional to child scores

Generation Summary

Gen Train Val Diff Type
initial 10% (1/10) 90% (9/10) baseline
1 50% (25/50) 70% (35/50) 57 code lines
2–3 empty
4 54% (27/50) 58% (29/50) perm-only
5–6 empty
7 60% (30/50) 60% (30/50) perm-only
8 empty
9 60% (30/50) 62% (31/50) perm-only
10 empty

What Happened in Gen 1

Generation 1 produced the only meaningful code change across the entire run: a 57-line diff to task_agent.py consisting of:

  1. Prompt simplification: Replaced a verbose, multi-paragraph peer-review instruction with a concise 4-line directive listing explicit accept/reject criteria.
  2. 3-tier prediction extraction: Added a fallback chain — (a) try JSON response key, (b) search all JSON values for "accept"/"reject", (c) regex word-boundary search on raw LLM output, using the last mention of accept or reject as the prediction.

This single edit moved val accuracy from 90% (on 10 samples — essentially random) to 70% (on 50 samples — a meaningful signal).

Why Subsequent Generations Failed

The meta-agent consistently produced empty diffs in generations 2–10, despite chat histories growing to 4MB (indicating extensive exploration). Two failure modes:

  1. Editing the wrong files: The meta-agent would modify files in domains/, meta_agent.py, or other infrastructure code. These changes get reverted by reset_paths_to_commit(), which resets the domains/ directory to the base commit after the meta-agent runs. Only changes to task_agent.py survive.

  2. str_replace matching failures: The meta-agent's old_str parameter must match the file content exactly (after expandtabs()). Even minor whitespace or escaping differences cause the edit to fail silently. The meta-agent receives an error message but often fails to recover, burning its remaining iterations on retry attempts.

Accuracy Stagnation

After gen 1's improvement, accuracy plateaued. The train accuracy drifted upward (50% → 60%) while val accuracy stayed flat (70%), suggesting the meta-agent's occasional permission-only changes (which survive as file mode diffs) interact with the LLM's stochastic output in ways that help on train but don't generalize.

The 70% val accuracy represents a model that is biased toward rejection (predicting "reject" ~70% of the time) in a dataset that is ~60% reject. The meta-agent has not yet found an edit that corrects this residual bias.

Next Steps

Several concrete improvements could push the pipeline past the 70% ceiling:

  1. Constrain meta-agent scope: Explicitly instruct the meta-agent to only modify task_agent.py, reducing wasted iterations on files that will be reverted.
  2. Fuzzy str_replace matching: Replace exact string matching with a more tolerant approach (e.g., normalized whitespace, fuzzy substring matching) to reduce edit failures.
  3. Larger eval samples: 50 samples introduces significant variance. Moving to 200+ would give more reliable accuracy estimates and reduce the chance of regression appearing as improvement.
  4. Multiple domains: Running the pipeline on the sudoku domain (which uses a TypeScript bridge to our cognitive architecture, The Demon) would test whether the meta-agent can evolve non-prompt code.

Conclusion

HyperAgents is a fascinating framework that treats AI agent improvement as a code evolution problem rather than a weight optimization problem. The infrastructure challenges are real but solvable, and the pipeline is now stable enough to run multi-hour experiments without crashes.

The key insight from our experiments: the bottleneck isn't the LLM's ability to understand code or diagnose problems. It's the edit execution layer — the meta-agent consistently identifies the right file and proposes the right class of fix, but fails to land the edit due to string-matching fragility. Improving that single component could unlock dramatically better evolutionary trajectories.


This is the first in a series of technical posts from Arachnida Apps documenting our work on autonomous AI evolution. Follow along at laplace.arachnida-apps.com.