First Contact: When an LLM Rewrote Its Own Prompt and Got Smarter

hyperagentsmeta-learningbreakthroughcode-evolution

The Setup

We had been debugging the HyperAgents code evolution pipeline for hours. Previous runs (janus_test_004, 005) had completed without crashes but produced only garbage diffs — file permission changes and __pycache__ bytecode binaries masquerading as code improvements. The meta-agent was exploring the codebase but never landing a successful edit.

janus_test_006 was configured with several improvements over its predecessors:

  • meta_iterations=10 — decoupled the meta-agent's tool-call budget from the generation count, giving it enough iterations to actually read files and propose changes.
  • eval_samples=50 — up from 10, providing more stable accuracy measurements.
  • __pycache__/ in .gitignore — preventing bytecode contamination in diffs.
  • Node.js in the Docker image — future-proofing for the sudoku domain.

The thesis was simple: if the meta-agent could successfully execute even one str_replace edit to task_agent.py, we would see a measurable accuracy change. If not, the pipeline was fundamentally broken at the edit-execution layer.

What Happened

Generation 1: The First Real Diff

The meta-agent completed its run and produced a model_patch.diff — but this time, it was different. Instead of permission changes or binary garbage, the diff contained actual code modifications to task_agent.py:

  1. Prompt simplification: The original paper review prompt was a verbose, multi-paragraph instruction set describing the peer review process in detail. The meta-agent replaced it with a concise directive listing explicit, unambiguous accept/reject criteria. The original prompt's complexity was causing the LLM to generate rambling responses that the JSON parser couldn't extract labels from.

  2. Regex fallback for prediction extraction: The original code relied solely on extract_jsons() from common.py to parse the LLM's response and find an "accept" or "reject" label. If the JSON parsing failed (which it frequently did — the LLM would return malformed JSON, or JSON with unexpected keys, or no JSON at all), the agent defaulted to predicting "None." The meta-agent added a three-tier fallback chain:

    • First, try to extract a response key from parsed JSON.
    • Second, search all JSON values for the strings "accept" or "reject."
    • Third, use a regex word-boundary search on the raw LLM output, taking the last mention of either word as the prediction.

This is a deceptively sophisticated edit. The meta-agent didn't just tweak a constant or add a log line — it recognized that the failure mode was extraction, not reasoning. The LLM was already producing accept/reject judgments in its output; the code just wasn't finding them reliably.

The Numbers

The results were immediately visible:

Metric Before (gen_initial) After (gen_1)
Train accuracy ~10% (1/10) 92% (46/50)
Val accuracy 90% (9/10)* 84% (34/50)

* The initial eval used only 10 samples, making that 90% unreliable. The gen_1 eval used 50 samples on each split.

92% on 50 training samples and 84% on 50 validation samples — that's not noise. The meta-agent had identified a real bottleneck (fragile JSON extraction) and implemented a working fix (multi-tier fallback).

The Crash

Then everything fell apart.

When generate_loop.py moved to set up generation 2, it called setup_initial_gen(), which attempts to clean up the previous generation's directory with shutil.rmtree(). The problem: Docker containers run as root. Every file they create — evaluation outputs, __pycache__ directories, agent output logs — is owned by root:root on the host. The host process running generate_loop.py runs as user janus (UID 1000). shutil.rmtree() doesn't have permission to delete root-owned directories.

PermissionError: [Errno 13] Permission denied: 'outputs/generate_janus_test_006/gen_1/paper_review_eval/__pycache__'

The process crashed. The gen_1 output directory — including the model_patch.diff that proved the pipeline could work — was lost (corrupted by the partial cleanup).

What We Learned

The Meta-Agent Can Reason About Code

This was the most important finding from the entire janus_test series. The meta-agent wasn't randomly mutating code and hoping for improvements. It:

  1. Read the evaluation report and identified low accuracy.
  2. Read task_agent.py and understood the prompt structure.
  3. Diagnosed the failure mode as an extraction problem, not a reasoning problem.
  4. Implemented a specific, targeted fix with a graceful degradation chain.

The LLM (GLM 5 Turbo, via our self-hosted LiteLLM proxy) demonstrated genuine program comprehension — it understood that the relationship between the prompt, the LLM's response format, and the extraction logic was the bottleneck, and it modified the extraction logic to be more robust.

The Pipeline Has Two Bottlenecks

janus_test_006 exposed the two fundamental bottlenecks of the HyperAgents pipeline:

  1. Edit execution (solved for gen_1): The str_replace tool requires exact string matching. The meta-agent must reproduce the target code verbatim — including whitespace, indentation, and escape characters. This is harder than it sounds, especially for multi-line replacements. Gen 1 succeeded here; later generations in subsequent runs (janus_test_008) consistently failed.

  2. Infrastructure fragility (not solved until janus_test_007+): Docker root ownership, git dubious ownership, __pycache__ contamination, and meta-agent context bloat are all infrastructure problems that have nothing to do with the quality of the meta-agent's reasoning. Each one is individually fixable but collectively they create a high failure rate for multi-generation runs.

Accuracy ≠ Understanding

The 92%/84% numbers look impressive, but they come with an important caveat. The paper review task is a binary classification problem on a dataset that is ~60% "reject." A model that predicts "reject" for every input would score 60%. Our 84% val accuracy represents genuine improvement — the model is correctly identifying some "accept" papers — but it is still heavily biased toward rejection.

In later runs (janus_test_008, 10 generations), val accuracy plateaued at 70%. The meta-agent has not yet found an edit that corrects the residual bias. This suggests that further improvement requires either a fundamentally different prompt strategy, a change to the task agent's architecture (e.g., chain-of-thought reasoning before classification), or more diverse evaluation data.

The Fix

The crash in janus_test_006 led directly to a Docker-based cleanup pattern in gl_utils.py. Instead of calling shutil.rmtree() directly (which fails on root-owned files), we now spin up a minimal Docker container that mounts the target directory and runs rm -rf as root, bypassing the host permission barrier:

def safe_rmtree(path):
    """Remove a directory tree that may contain root-owned files from Docker."""
    client = docker.DockerClient()
    client.containers.run(
        image='hyperagents:local',
        network_mode='host',
        volumes={str(path.parent): {'bind': '/cleanup', 'mode': 'rw'}},
        command=['rm', '-rf', f'/cleanup/{path.name}'],
        remove=True,
    )

This pattern — using Docker's root privilege to solve the permission problem that Docker created — is inelegant but effective. Every subsequent run (janus_test_007, 008) completed without permission crashes.

Significance

janus_test_006 was a milestone for the project, even though it crashed. It proved that:

  • An LLM can autonomously improve its own task performance by modifying its source code, given the right feedback loop.
  • The improvement is targeted and principled — not random mutation, but a reasoned response to empirical failure data.
  • The meta-agent's bottleneck is execution, not intelligence — it can diagnose problems faster than it can fix them.

This is the foundational observation for the entire Laplace's Demon project. If we can make the edit-execution layer more reliable — through fuzzy matching, constrained edit scopes, or a different editing paradigm — the meta-agent should be able to drive sustained, multi-generational improvement.

The crash was a feature, not a bug. It forced us to harden the infrastructure so that future runs could complete reliably. And they did: janus_test_008 ran all 10 generations without a single crash. The next challenge is making those generations matter.


Part of the Research Log from Arachnida Apps. Read about the full pipeline architecture or the project's founding vision.