// LLM knowledge base for the hill-climb loop — harness engineering (AutoAgent) × external-world configuration (autoresearch)
#id, a type badge, a one-paragraph description, and chips linking to what depends on it. Both systems implement the same closed loop — so the operator algebra (Reflect · Select · Improve · Evaluate · Commit) has a single canonical node set that both frameworks reference. Load this page as context when an LLM needs to reason about a hill-climb implementation instead of scanning the prose guide.
┌──────────────── DIRECTIVE ────────────────┐
│ program.md (human-authored, persistent) │
└──────────────────┬────────────────────────┘
▼
┌───────── META-AGENT ──────────┐
│ (Claude Code · Codex · any) │
└──────────────┬────────────────┘
▼
┌────────────────── THE LOOP (SEPL) ──────────────────┐
│ ρ Reflect → H failure hypotheses │
│ σ Select → D modification primitives │
│ ι Improve → V' candidate state │
│ ε Evaluate → S scores + safety │
│ κ Commit → V next state (or rollback) │
└──────────────────────────┬──────────────────────────┘
▼
┌─────────────────┬───────────────────┐
│ AUTOAGENT │ AUTORESEARCH │
│ target = agent.py target = external system │
│ score = Harbor total score = domain eval [0,1] │
│ patch = free Python patch = RFC 6902 JSON Patch │
│ store = results.tsv store = Cloudflare KV │
└─────────────────┴───────────────────┘
Meta-agent harness engineering loop from kevinrgu/autoagent (thirdlayer.inc). Edit program.md; let a coding agent iterate agent.py overnight against Harbor benchmarks. The harness under test is the mutation target; Harbor total score is the hill-climb signal.
Domain-agnostic hill-climb on external-world configuration. A meta-agent watches a signal source, proposes reversible patches to a target system, scores the result against a domain-specific eval, keeps or rolls back. Reference deployment: Organized-AI/gtm-autoresearch against GTM / sGTM container state.
Formal two-layer protocol for self-evolving agents (arXiv 2604.15034, Wentao Zhang, 2026-04-16). Layer 1 RSPL names the evolvable substrate; Layer 2 SEPL defines the five-operator closed loop. AutoAgent + autoresearch are practical instantiations of the same pattern.
The paper's concrete instantiation of AGP. Agent Bus architecture — Orchestrator produces plan.md as a versioned RSPL resource, sub-agents run concurrently via the bus, agent-as-tool composition supported. Self-evolution triggers when traces signal correctable failures.
Benchmark runner from the Laude Institute. Harbor tasks are directories with a setup, an agent entry point, and a test suite. AutoAgent uses the aggregate Harbor score as its hill-climb signal. Parallel execution (-n 100) is standard.
Signature ρ: Z × V_evo → ℘(H). Semantic-gradient approximator. Maps execution traces to causal failure hypotheses in the variable space. Default implementation prompts the backbone LLM to produce natural-language diagnoses.
Signature σ: V_evo × ℘(H) → ℘(D). Generative policy. Translates hypotheses into concrete modification proposals, sampling candidates designed to minimise the identified error signal subject to structural constraints.
Signature ι: V_evo × ℘(D) → V'_evo. Mutation operator. Applies updates via standardised RSPL interfaces, yielding a provisional candidate state. In AutoAgent: Python edits to agent.py. In autoresearch: RFC 6902 patch application.
Signature ε: V'_evo × G → S. Objective function. Maps candidate state + goal spec to evaluation space (quantitative scores + strict safety invariants). Harbor total score for AutoAgent; per-domain eval suite for autoresearch.
Signature κ: V'_evo × S → V_evo. Conditional gating mechanism. Accepts the candidate only when success criteria + safety invariants hold; rolls back otherwise. This is what makes the trajectory monotonically improving by construction, not a random walk.
Self-Evolution Protocol Layer. Control-theoretic formalism over RSPL. Operator algebra (ρσιεκ) plus Algorithm 1 (the closed loop). Variable lifting projects heterogeneous RSPL resources onto the unified evolvable variable space V_evo.
Resource Substrate Protocol Layer. Defines the evolvable substrate as protocol-registered resources with explicit state, lifecycle, and version lineage. Resources are passive — they encapsulate no optimisation logic; state transitions happen only through interface-mediated operations from SEPL.
Instructions. System + task-specific text. Versioned. Learnability mask g ∈ {0,1} decides whether the optimiser may edit it.
Decision policies — the reasoning/acting loop itself. Can be replaced or mutated like any other resource once protocol-registered.
Actuation interfaces — native scripts, MCP tools, agent skills. Tools are first-class evolvable resources, not hard-coded internal components.
Task / world dynamics. The context the agent perceives. Swappable — same agent paired with different environments for transfer.
Persistent state across turns and sessions. Externalised from the agent so the same agent policy can run with different memory layouts.
Tuple (n, d, φ, g, m): name, description, input-to-output mapping, trainable marker g ∈ {0,1}, metadata. The learnability bit is what makes AutoAgent's "fixed adapter section, editable harness section" split formal.
Default optimiser in AutoAgent + the AGP paper's primary instantiation. ρ prompts the backbone LLM for natural-language failure hypotheses; σ translates them into concrete code / prompt edits; ι applies via RSPL set-variables interface.
Natural-language feedback treated as a "textual gradient". σ generates gradient-informed proposals; ι applies string-level edits. Reuses standard ε / κ.
RL-style optimiser. ρ samples multiple candidate trajectories; σ ranks by reward; ι policy-gradient updates (prompt weights, LoRA adapters); κ commits if the policy exceeds a baseline return threshold.
Group-relative policy-gradient sampling. Same RL shape as Reinforce++; baselined against group mean reward for variance reduction.
Model-tier escalation policy used by both autoresearch and the Hermes model stack. Claude Sonnet drives rounds while score < 0.92; the first round crossing ≥ 0.92 promotes to Opus 4.6 and stays there. One-way by design so cheap rounds don't re-earn the threshold after every regression.
Search strategy: propose a local change, measure if you're higher, revert if not. Simplest strategy that works when you can only measure after the fact. Both AutoAgent and autoresearch hill-climb on their respective scores.
Full record of an agent's decisions, tool calls, and reasoning during a benchmark / round. The meta-agent diffs trajectories between rounds to identify failure clusters so mutations are targeted, not blind.
Deterministic reverse of a rejected mutation. RFC 6902 op reverses per-op (add↔remove, replace-with-prior-value). AutoAgent: discard the Python edit; autoresearch: apply the inverse patch. Without rollback the loop degrades to a random walk.
IETF spec for describing JSON document changes as an operation list (add / replace / remove / move / copy / test). Autoresearch's default mutation vocabulary because every op has a canonical reverse.
Default version store for autoresearch. Schema: client/<slug>/anchor (baseline snapshot) · client/<slug>/round/<n> (patch + score + meta) · client/<slug>/head · client/<slug>/eval/<n>. Replay from anchor reconstructs any round.
Domain-specific objective function in [0, 1]. Must be cheap enough to run on every candidate before commit. Example: gtm-autoresearch's client-eval-generator skill materialises the eval suite from Meta Ads insights + GTM exports + a client profile.
AGS's coordination backbone. Agents communicate only through standardised bus messages — loose coupling, transparent observability, concurrent sub-agent execution. Orchestrator plans but does not execute; sub-agents execute, write to shared memory, report back.
AGP paper empirical finding: GPT-4.1 gains +71% on AIME24 under evolve-prompt+solution; gemini-3-flash-preview already at 83–88% gains only 2–12%. Rationale for the 0.92 escalation gate — below it, iteration speed wins; above it, reasoning depth wins.
Human-authored directive. Goal, constraints, success bar, stop conditions. The only file a human edits regularly. Meta-agent re-reads every round.
Single-file harness under test in AutoAgent. Editable section (prompt · registries · tools · routing) plus an explicitly-fixed Harbor adapter section. The meta-agent's primary mutation target.
Harbor-format benchmark tasks — each a directory with setup, entry point, test suite. Typically added in benchmark-specific branches so the baseline branch stays clean.
Optional workspace artefacts — reusable instructions, notes, prompts, skill files the meta-agent can draw on between runs.
AutoAgent's experiment log. Gitignored. Written by the meta-agent — one row per round with the mutation, score, and kept/discarded decision.
Latest Harbor run output. uv run harbor run ... > run.log 2>&1. The meta-agent tails this to diagnose failures before proposing the next mutation.
AGS-specific. Versioned RSPL resource written by the Orchestrator. Human-readable flowchart + ordered subtask list + assignments to sub-agents. Coordination structure is itself inspectable and evolvable.
Autoresearch's append-only run manifest. One entry per round with patch, score, KV key. Pairs with the per-round KV entry for full auditability.
198 graduate STEM MCQs. Closed-book; Google-proof. Measures deep scientific reasoning beyond factual recall. Used by the AGP paper to show evolve-prompt+solution gains.
Competition math (American Invitational Mathematics Examination). 30 problems each. Measures long-horizon symbolic reasoning + arithmetic precision. GPT-4.1 went from 23.3% → 40.0% on AIME24 under evolve-prompt+solution (+71.4%).
300 real-world multi-step tasks. Planning + reliable tool use (browsing, documents, files). Used for AGS tool-chaining evaluation.
200 train / 100 test. Multi-language (Python · C++ · Java · Go). Reduced data-contamination cut of recent problems. Metrics: acceptance · pass rate · runtime.
Install AutoAgent's Python deps. Requires uv (astral.sh/uv). Base image built via docker build -f Dockerfile.base -t autoagent-base ..
Execute one or all Harbor tasks against agent.py. -n 100 for parallel; output lands in jobs/. Meta-agent reads score + trajectories from there.
The exact prompt AutoAgent uses to start a meta-agent run. Point any coding agent at the repo and paste this — the agent does the rest overnight.
Autoresearch's per-round runner (reference deployment). Full pipeline via scripts/run-all.sh. All scripts idempotent — re-running never duplicates.
Drives autoresearch's version store. Wrapped by scripts/lib/kv-store.ts (idempotent writes) in the reference deployment. Depends on a logged-in local wrangler.
The prose guide this wiki distills. 10 tabs — Overview · The Loop · AutoAgent · Autogenesis (AGP) · Autoresearch · Harbor · JSON Patch + KV · Model Escalation · Hermes Runtime · Quick Start · Glossary + FAQ.
Runtime story. Hermes is the production host for graduated AutoAgent harnesses — launchd-supervised Pi harnesses on claws-mac-mini.
Sibling session orchestrator. Multi-provider executor registry (Claude · Codex · OpenCode) mirrors the same adapter-shape AGP encourages for Tool resources.
Concrete autoresearch deployment — GTM container mutation against per-client Meta Ads evals. Anchor for the patterns described here.