91% With a Mid-Tier Model: janus_test_009 and the Late-Run Surge
The Question
After janus_test_008 plateaued at 70% validation accuracy, we were left with an open question: was the ceiling a fundamental limitation of the architecture, or simply a symptom of insufficient evolutionary runway? Ten generations gave the meta-agent one meaningful edit. What would a hundred give it?
There was a more provocative framing lurking underneath. Meta's published DGM-H result on the same paper review benchmark — the one that made headlines — used Claude and GPT-4o, models at the absolute frontier of capability. It achieved 71% test accuracy. Could a mid-tier model, given the right architecture and enough time, match that?
We ran janus_test_009 to find out.
The Configuration
This was our longest and most ambitious run to date:
- 101 generations (~20 hours wall-clock time)
- Zero crashes — the infrastructure hardening from tests 006–008 paid off
- Model: GLM 5.1 — a mid-tier model, not a frontier model
- Meta-agent was ablated: it could only edit
task_agent.py, not its ownmeta_agent.pycode (no metacognitive self-modification) - 100-sample evaluation per split (up from 50 in test 008)
score_child_propparent selection with staged evaluation
The domain remained paper review: binary accept/reject classification of academic papers.
The ablation is critical context. In Meta's DGM-H experiments, the meta-agent was allowed to modify its own improvement logic — its prompts, its evaluation strategies, even its editing patterns. Our meta-agent had no such autonomy. It was a surgeon that could operate on the patient but not on its own hands.
The Numbers at a Glance
| Metric | Value |
|---|---|
| Champion generation | gen_90 |
| Train accuracy | 78% |
| Validation accuracy | 91% |
| Total real code diffs | 21 / 101 generations (20.8% edit rate) |
| Empty diffs | 80 |
| "Output context exceeded" errors | 348 |
| Meta DGM-H benchmark comparison | 71% test accuracy (Claude/GPT-4o) |
Twenty-one real diffs out of 101 attempts. Three hundred and forty-eight context overflow errors. The meta-agent spent most of its compute budget thinking in circles. And yet.
The Accuracy Curve
The trajectory tells the story better than the final number:
| Generation | Train | Val | Delta from previous champion |
|---|---|---|---|
| initial | — | — | baseline |
| gen_2 | 73% | 78% | +78% |
| gen_18 | 75% | 83% | +5% |
| gen_52 | — | 85% | +2% |
| gen_73 | 78% | 84% | — |
| gen_81 | 79% | 87% | +3% |
| gen_83 | — | 88% | +1% |
| gen_85 | — | 88% | — |
| gen_90 | 78% | 91% | +3% |
Two things stand out immediately.
First, the early generations are slow. It takes 18 generations to get from 78% to 83%, and another 34 to reach 85%. The system is accumulating mutations in the archive — most of them useless — but occasionally one sticks.
Second, and more surprising: the rate of improvement accelerates after generation 70. From gen_73 to gen_90, just 17 generations, the system gains 7 percentage points. From gen_2 to gen_73, 71 generations, it gains 6. The system got better at improving as it accumulated more mutations in the archive.
We did not expect this. We expected diminishing returns — the classic optimization curve flattening out as the easy gains are captured. Instead, the opposite happened.
Two Lineages Converge
Before gen_90 broke through, generations 83 and 85 both independently reached 88% — from different ancestors. The archive wasn't just climbing one hill; it was exploring multiple ridges of the fitness landscape, and two of them converged on the same local peak before gen_90 found a path beyond it.
This is exactly the kind of behavior you want from an evolutionary system: parallel exploration, convergent validation, and then a breakthrough that neither lineage could have reached alone.
gen_90: The Champion's Insight
Generation 90 modified the scoring classifier in task_agent.py. The diff is a masterclass in pragmatic engineering — no architectural overhaul, just a recalibration of the decision boundaries:
# Before (original thresholds)
accept_threshold = 3.5
reject_threshold = 2.5
weakness_rejection_count = 3 # reject if 3+ low scores
# After (gen_90 thresholds)
accept_threshold = 4.0 # raised — require stronger signal for accept
reject_threshold = 2.0 # lowered — reject only with clearer negative signal
weakness_rejection_count = 2 # reject if 2+ low scores (stricter)
# NEW: critical flaw rejection
# if soundness ≤ 1 or novelty ≤ 1 → reject regardless of other scores
The logic is elegant in its conservatism:
-
Wider threshold bands: Accept requires ≥4.0 (up from 3.5), reject requires <2.0 (down from 2.5). The ambiguous middle zone is wider, and the meta-agent routes those uncertain cases to the LLM for a second opinion rather than relying on the scoring heuristic.
-
Stricter weakness rejection: Two or more low scores now trigger rejection (down from three). The meta-agent recognized that the previous threshold was too permissive — papers with multiple weak dimensions were slipping through.
-
Critical flaw rejection: Soundness ≤1 or novelty ≤1 is an automatic reject, regardless of other scores. This is a kill switch for papers with fundamental flaws — the meta-agent learned that no amount of methodological elegance can rescue a paper that is unsound or unoriginal.
What's remarkable is that this isn't a prompt rewrite or a new extraction strategy — it's a scoring policy adjustment. The meta-agent moved up the abstraction ladder from "how do I parse the LLM's response?" to "how should I make the final decision?" This is the kind of meta-level improvement that only becomes possible after the lower-level plumbing is reliable.
The Champion's Lineage
gen_90 didn't emerge from nowhere. It is the culmination of a 9-ancestor lineage — and that lineage is not the clean upward path you might imagine:
initial
└→ gen_2 (73%/78%)
└→ gen_18 (75%/83%)
└→ gen_25 (64%/66%) ← REGRESSION (-17%)
└→ gen_44 (80%/82%)
└→ gen_53 (76%/81%)
└→ gen_73 (78%/84%)
└→ gen_79 (73%/78%) ← REGRESSION
└→ gen_81 (79%/87%)
└→ gen_90 (78%/91%) ← CHAMPION
Two regressions. gen_25 drops 17 points from its parent gen_18. gen_79 drops 6 points from gen_73. These are not minor fluctuations — they represent the meta-agent making changes that looked promising but hurt performance.
And the archive correctly routes around both of them. gen_44 doesn't descend from gen_25's degraded state; it's selected from an earlier ancestor and builds forward. gen_81 doesn't inherit gen_79's regression; it backtracks to gen_73's stronger position.
This is the archive's parent selection mechanism doing exactly what it should: treating regressions as data, not dead ends. The evolutionary search is robust precisely because it doesn't require monotonic improvement.
The Regression Pattern
The two largest regressions share a common cause. gen_49 dropped 19% from parent gen_44, and gen_25 dropped 17% from parent gen_18. Both were the product of the largest diffs in the run — ambitious, multi-part edits that touched several aspects of the scoring logic simultaneously.
The lesson is clear: the meta-agent's most ambitious edits are also its most dangerous. Small, targeted changes (like gen_90's threshold adjustments) are far more likely to improve performance than sweeping rewrites. This is consistent with what we saw in janus_test_006, where the single most effective edit was a focused extraction fix, not a prompt overhaul.
The Caveats
We want to be precise about what 91% means — and what it doesn't.
Validation ≠ test. Our 91% is measured on a validation split. Meta's 71% is measured on a held-out test split. These are not directly comparable. There may be overfitting to the validation distribution that inflates our number.
But the gap is wide. Even with significant overfitting, a 20-point gap over Meta's result is hard to explain away entirely. And we achieved it with:
- A weaker model (GLM 5.1 vs Claude/GPT-4o)
- An ablated configuration (no metacognitive self-modification)
- 80% of compute wasted (see below)
The comparison is nuanced, and we're not claiming a decisive victory. But the signal is strong enough to be taken seriously.
The Wastage
For all the excitement about the 91% result, the efficiency numbers are sobering:
- 80 out of 101 generations produced empty diffs — the meta-agent explored the codebase but failed to land an edit
- 348 "Output context exceeded" errors across the run — GLM 5.1's chain-of-thought consumed the 65K output token budget before producing tool calls
The meta-agent was thinking too much and doing too little. It would read files, analyze the evaluation report, formulate a sophisticated diagnosis... and then run out of tokens before it could write the str_replace call to implement its fix.
This is a model-level limitation, not an architecture limitation. A model with better token budget management — or a system that interrupts chain-of-thought to force tool calls — would likely produce a much higher edit rate.
Zero Emergent Infrastructure
One of the most interesting findings from Meta's DGM-H paper was that their meta-agent spontaneously developed performance trackers, persistent memory systems, and self-monitoring infrastructure. Our meta-agent did none of this.
Every one of the 21 real diffs was a tactical edit to task_agent.py — adjusting thresholds, modifying prompts, tweaking extraction logic. The meta-agent never stepped back and thought "I should build a tool to help me keep track of what I've tried." It just kept making local edits and hoping.
This is almost certainly a consequence of two factors:
- The ablation: the meta-agent couldn't modify its own code, so it couldn't build itself new tools even if it wanted to.
- The model: GLM 5.1 may lack the planning horizon to conceive of infrastructure-level improvements. Frontier models might spontaneously develop these capabilities given the same setup.
The absence of emergent infrastructure is both a limitation and an opportunity. It means the improvements we're seeing are purely from local code edits — no "cheating" via sophisticated self-monitoring. It also means there's a massive untapped reserve of improvement waiting to be unlocked when the meta-agent can build its own scaffolding.
What This Means
Let's put the pieces together.
The architecture works. 91% validation accuracy, zero crashes, a coherent champion lineage — this is a working self-improvement pipeline, not a proof of concept. The HyperAgents framework, even with our modifications and ablations, can drive sustained multi-generational improvement.
Late-run acceleration is real. This is the finding that surprised us most. The system didn't plateau; it accelerated. More accumulated mutations in the archive gave the parent selection mechanism more raw material to work with, and the meta-agent's edits got better — not because the model changed, but because the evolutionary context enriched around it.
Most compute is wasted. 80% empty diffs, 348 context overflows. The bottleneck has shifted from infrastructure (tests 006–008) to model capability and token management. This is progress — we've solved the easy problems and are now hitting the interesting ones.
The ablation reveals the floor. This is what an ablated system achieves with a mid-tier model. No metacognitive self-modification. No emergent infrastructure. No frontier-model intelligence. And it still produced a champion that may outperform Meta's fully-configured DGM-H on the same benchmark.
What's Next
The next run will remove the ablation.
If gen_90's threshold tuning — a tactical, local edit — can reach 91%, what happens when the meta-agent can modify its own improvement logic? When it can edit meta_agent.py, adjust its own prompt, change its editing strategy, build itself tools?
The late-run acceleration suggests that the system becomes more effective as it accumulates evolutionary history. If the meta-agent could also accumulate cognitive history — learning from its own failed edits, remembering which strategies work — the acceleration could compound.
janus_test_010 will be configured with full metacognitive self-modification enabled. The meta-agent will be free to edit any file in the codebase, including itself.
If 91% is the floor, we're very interested to see the ceiling.
This is part of the Research Log from Arachnida Apps. Read about the pipeline architecture, the first successful edit, or the project's founding vision.