Paper: Meta-Harness: End-to-End Optimization of Model Harnesses.

The performance of large language model (LLM) applications is rarely bottlenecked by the model weights alone. Often, the difference between a failing system and a state-of-the-art agent lies in the harness—the code responsible for orchestrating prompts, retrieving context, managing memory, and updating state. A good harness can yield a 6× performance gap on the same benchmark. However, harness engineering is currently a tedious, manual process of trial and error.

Meta-Harness is an outer-loop system that treats the harness itself as an executable program to be optimized.


1. The Mathematical Formulation of Harness Optimization

At its core, a harness $H$ is a stateful program that wraps a fixed LLM $M$. For a given task instance $x$ drawn from a task distribution $X$, executing the harness generates a rollout trajectory $\tau \sim p_M(H, x)$. This trajectory includes the prompts constructed by the harness, the LLM’s responses, and the corresponding state updates.

Given a task-specific reward function $r(\tau, x)$, the goal of harness optimization is to find the optimal harness $H^*$ that maximizes the expected final reward:

\[H^* = \arg \max_H \mathbb{E}_{x \sim X, \tau \sim p_M(H,x)} r(\tau, x)\]

Because we care about multiple objectives in real-world deployments (e.g., maximizing accuracy while minimizing token context cost), Meta-Harness evaluates candidate harnesses under Pareto dominance to output a Pareto frontier of optimal harnesses.


2. The Core Innovation: Uncompressed Feedback via Filesystem Access

Prior text optimization methods (like OPRO, TextGrad, or OpenEvolve) compress feedback heavily. They rely on scalar scores, short templates, or LLM-generated summaries, typically exposing the optimizer to only 100 to 30,000 tokens of context. This is fundamentally mismatched for harness engineering because early context-management choices cascade into failures many steps later, and compressing the feedback destroys the necessary diagnostic footprint.

Meta-Harness solves this by using a coding agent (Claude Code powered by Opus-4.6) as the proposer, granting it unrestricted access to a filesystem containing the source code, evaluation scores, and full execution traces of all prior candidates.

  • Scale of Feedback: A single evaluation can produce up to 10,000,000 tokens of diagnostic information, which is 1,000× larger than prior optimization budgets.
  • Agentic Navigation: Because the context is too large to stuff into a prompt, the proposer actively navigates the filesystem using standard terminal tools like grep and cat. During search, the proposer reads a median of 82 files per iteration, split roughly evenly between prior harness source code (41%) and execution traces (40%).

3. Technical Discoveries: What Do Optimized Harnesses Look Like?

By operating in code space, the proposer discovers domain-specific, highly structured algorithms rather than brittle, hard-coded rules.

A. Online Text Classification (Label-Primed Query)

Instead of just passing nearest-neighbor examples, the best discovered harness generates a highly optimized single-prompt structure.

  • Mechanics: It uses TF-IDF retrieval with a query-anchored pairing rule. It builds a prompt consisting of a Label Primer (listing all valid outputs), a Coverage Block (one query-relevant example per class to expose the full label space), and a Contrastive Block (highly similar examples with different labels to sharpen local decision boundaries).
  • Results: This harness reached 48.6% accuracy, outperforming the state-of-the-art Agentic Context Engineering (ACE) baseline by 7.7 points while using 4× fewer context tokens (11.4K vs 50.8K).

B. Mathematical Reasoning (Four-Route Lexical Router)

Retrieving examples for math problems often hurts performance because naive dense retrieval pulls mathematically irrelevant examples. Meta-Harness evolved a compact four-route BM25 program.

  • Mechanics: A lexical router assigns problems to Combinatorics, Geometry, Number Theory, or Default based on regex and keyword cues. Each route executes a bespoke policy. For example, Combinatorics retrieves 20 candidates, deduplicates to 8, reranks by lexical score and difficulty, and keeps 3. Geometry skips difficulty reranking entirely, returning 1 hard fixed reference and 2 raw BM25 neighbors.
  • Results: This discovered harness improved accuracy on 200 held-out IMO-level problems by an average of 4.7 points across five unseen models (including GPT-5.4 variants and Gemini-3 models).

C. Agentic Coding (TerminalBench-2)

For long-horizon OS agent tasks, the system inherited the Terminus-KIRA baseline but discovered a critical “environment bootstrap” addition.

  • Mechanics: Before the agent loop begins, it runs a compound shell command with a 15-second timeout to capture a snapshot of the OS, working directory, installed package managers (pip, apt), and languages.
  • Results: By injecting this into the initial prompt, the harness eliminates 2–4 wasted early exploration turns, pushing the Claude Opus 4.6 pass rate to 76.4% (ranking #2 globally) and achieving state-of-the-art on Haiku 4.5 (37.6%).

4. Expert Insights: Causal Reasoning in Code Space

Perhaps the most fascinating technical insight is how the proposer traverses the search space. Because it has access to raw execution traces, the agent performs explicit causal reasoning over its own prior failures.

In the TerminalBench-2 logs, the proposer initially attempted to bundle prompt template rewrites with structural state-machine bugfixes, resulting in performance regressions. By reading the execution traces of these failures, the agent explicitly hypothesized that the prompt rewrites were a confounding variable causing the agent to delete necessary state. It isolated the structural fix, tested it, and eventually pivoted to a purely additive strategy (the environment bootstrap) to avoid the fragile completion logic altogether.

This demonstrates that giving optimization models access to raw programmatic history unlocks diagnostic capabilities that compressed scalar rewards completely obscure. The future of LLM optimization may look less like gradient descent on weights, and more like agentic credit assignment on executable orchestrations.