AutoAgent × Autoresearch Wiki — LLM knowledge base for the hill-climb loop

// LLM knowledge base for the hill-climb loop — harness engineering (AutoAgent) × external-world configuration (autoresearch)

Frameworks

AutoAgentframework

Meta-agent harness engineering loop from kevinrgu/autoagent (thirdlayer.inc). Edit program.md; let a coding agent iterate agent.py overnight against Harbor benchmarks. The harness under test is the mutation target; Harbor total score is the hill-climb signal.

→program.md agent.py Harbor the loop

Autoresearchframework

Domain-agnostic hill-climb on external-world configuration. A meta-agent watches a signal source, proposes reversible patches to a target system, scores the result against a domain-specific eval, keeps or rolls back. Reference deployment: Organized-AI/gtm-autoresearch against GTM / sGTM container state.

→RFC 6902 Cloudflare KV domain eval the loop

Autogenesis (AGP)framework

Formal two-layer protocol for self-evolving agents (arXiv 2604.15034, Wentao Zhang, 2026-04-16). Layer 1 RSPL names the evolvable substrate; Layer 2 SEPL defines the five-operator closed loop. AutoAgent + autoresearch are practical instantiations of the same pattern.

→RSPL SEPL arxiv 2604.15034 AGS

AGS — Autogenesis Systemframework

The paper's concrete instantiation of AGP. Agent Bus architecture — Orchestrator produces plan.md as a versioned RSPL resource, sub-agents run concurrently via the bus, agent-as-tool composition supported. Self-evolution triggers when traces signal correctable failures.

←AGP→plan.md agent bus

Harborframework

Benchmark runner from the Laude Institute. Harbor tasks are directories with a setup, an agent entry point, and a test suite. AutoAgent uses the aggregate Harbor score as its hill-climb signal. Parallel execution (-n 100) is standard.

←AutoAgent→tasks/trajectory

the loopframework

The six-step closed control loop both systems implement: directive → inspect → run → score → mutate → keep-or-discard. SEPL is the formal statement; AutoAgent and autoresearch differ only in what fills each slot.

↔SEPL ρ σ ι ε κ

SEPL operators (ρσιεκ)

Reflect (ρ)operator

Signature ρ: Z × V_evo → ℘(H). Semantic-gradient approximator. Maps execution traces to causal failure hypotheses in the variable space. Default implementation prompts the backbone LLM to produce natural-language diagnoses.

→σ trajectory

Select (σ)operator

Signature σ: V_evo × ℘(H) → ℘(D). Generative policy. Translates hypotheses into concrete modification proposals, sampling candidates designed to minimise the identified error signal subject to structural constraints.

←ρ→ι

Improve (ι)operator

Signature ι: V_evo × ℘(D) → V'_evo. Mutation operator. Applies updates via standardised RSPL interfaces, yielding a provisional candidate state. In AutoAgent: Python edits to agent.py. In autoresearch: RFC 6902 patch application.

←σ→ε

Evaluate (ε)operator

Signature ε: V'_evo × G → S. Objective function. Maps candidate state + goal spec to evaluation space (quantitative scores + strict safety invariants). Harbor total score for AutoAgent; per-domain eval suite for autoresearch.

←ι→κ

Commit (κ)operator

Signature κ: V'_evo × S → V_evo. Conditional gating mechanism. Accepts the candidate only when success criteria + safety invariants hold; rolls back otherwise. This is what makes the trajectory monotonically improving by construction, not a random walk.

←ε→rollback

SEPL (Layer 2)layer

Self-Evolution Protocol Layer. Control-theoretic formalism over RSPL. Operator algebra (ρσιεκ) plus Algorithm 1 (the closed loop). Variable lifting projects heterogeneous RSPL resources onto the unified evolvable variable space V_evo.

↔RSPL variable lifting→ρ

RSPL entities (Layer 1)

RSPL (Layer 1)layer

Resource Substrate Protocol Layer. Defines the evolvable substrate as protocol-registered resources with explicit state, lifecycle, and version lineage. Resources are passive — they encapsulate no optimisation logic; state transitions happen only through interface-mediated operations from SEPL.

→Prompt Agent Tool Environment Memory resource entity

Prompt (ε_Prompt)RSPL

Instructions. System + task-specific text. Versioned. Learnability mask g ∈ {0,1} decides whether the optimiser may edit it.

←RSPL

Agent (ε_Agent)RSPL

Decision policies — the reasoning/acting loop itself. Can be replaced or mutated like any other resource once protocol-registered.

←RSPL

Tool (ε_Tool)RSPL

Actuation interfaces — native scripts, MCP tools, agent skills. Tools are first-class evolvable resources, not hard-coded internal components.

←RSPL

Environment (ε_Env)RSPL

Task / world dynamics. The context the agent perceives. Swappable — same agent paired with different environments for transfer.

←RSPL

Memory (ε_Mem)RSPL

Persistent state across turns and sessions. Externalised from the agent so the same agent policy can run with different memory layouts.

←RSPL

resource entityformalism

Tuple (n, d, φ, g, m): name, description, input-to-output mapping, trainable marker g ∈ {0,1}, metadata. The learnability bit is what makes AutoAgent's "fixed adapter section, editable harness section" split formal.

↔RSPL Prompt Agent

variable liftingconcept

Project discrete, heterogeneous RSPL resources onto a unified representation of evolvable variables V_evo. Homogenises the interaction surface so the same operator algebra works across prompts, tool code, plans, memory.

↔SEPL RSPL

Pluggable optimisers

Reflectionoptimizer

Default optimiser in AutoAgent + the AGP paper's primary instantiation. ρ prompts the backbone LLM for natural-language failure hypotheses; σ translates them into concrete code / prompt edits; ι applies via RSPL set-variables interface.

↔SEPL ρ

TextGradoptimizer

Natural-language feedback treated as a "textual gradient". σ generates gradient-informed proposals; ι applies string-level edits. Reuses standard ε / κ.

↔SEPL

Reinforce++optimizer

RL-style optimiser. ρ samples multiple candidate trajectories; σ ranks by reward; ι policy-gradient updates (prompt weights, LoRA adapters); κ commits if the policy exceeds a baseline return threshold.

↔SEPL

GRPOoptimizer

Group-relative policy-gradient sampling. Same RL shape as Reinforce++; baselined against group mean reward for variance reduction.

↔SEPL

Sonnet → Opus 4.6 @ 0.92policy

Model-tier escalation policy used by both autoresearch and the Hermes model stack. Claude Sonnet drives rounds while score < 0.92; the first round crossing ≥ 0.92 promotes to Opus 4.6 and stays there. One-way by design so cheap rounds don't re-earn the threshold after every regression.

↔diminishing returns

Core concepts

hill-climbconcept

Search strategy: propose a local change, measure if you're higher, revert if not. Simplest strategy that works when you can only measure after the fact. Both AutoAgent and autoresearch hill-climb on their respective scores.

↔the loop

trajectoryconcept

Full record of an agent's decisions, tool calls, and reasoning during a benchmark / round. The meta-agent diffs trajectories between rounds to identify failure clusters so mutations are targeted, not blind.

←Harbor→ρ

rollbackconcept

Deterministic reverse of a rejected mutation. RFC 6902 op reverses per-op (add↔remove, replace-with-prior-value). AutoAgent: discard the Python edit; autoresearch: apply the inverse patch. Without rollback the loop degrades to a random walk.

↔RFC 6902 κ

RFC 6902 JSON Patchprotocol

IETF spec for describing JSON document changes as an operation list (add / replace / remove / move / copy / test). Autoresearch's default mutation vocabulary because every op has a canonical reverse.

↔Autoresearch KV

Cloudflare KV (drift history)store

Default version store for autoresearch. Schema: client/<slug>/anchor (baseline snapshot) · client/<slug>/round/<n> (patch + score + meta) · client/<slug>/head · client/<slug>/eval/<n>. Replay from anchor reconstructs any round.

↔Autoresearch RFC 6902 wrangler kv

per-domain evalscore

Domain-specific objective function in [0, 1]. Must be cheap enough to run on every candidate before commit. Example: gtm-autoresearch's client-eval-generator skill materialises the eval suite from Meta Ads insights + GTM exports + a client profile.

↔Autoresearch ε

agent busarchitecture

AGS's coordination backbone. Agents communicate only through standardised bus messages — loose coupling, transparent observability, concurrent sub-agent execution. Orchestrator plans but does not execute; sub-agents execute, write to shared memory, report back.

↔AGS plan.md

weak-models-gain-morefinding

AGP paper empirical finding: GPT-4.1 gains +71% on AIME24 under evolve-prompt+solution; gemini-3-flash-preview already at 83–88% gains only 2–12%. Rationale for the 0.92 escalation gate — below it, iteration speed wins; above it, reasoning depth wins.

↔escalation policy benchmarks

Primitives (files and artefacts)

program.mdfile

Human-authored directive. Goal, constraints, success bar, stop conditions. The only file a human edits regularly. Meta-agent re-reads every round.

↔AutoAgent Autoresearch

agent.pyfile

Single-file harness under test in AutoAgent. Editable section (prompt · registries · tools · routing) plus an explicitly-fixed Harbor adapter section. The meta-agent's primary mutation target.

←AutoAgent↔Harbor

tasks/dir

Harbor-format benchmark tasks — each a directory with setup, entry point, test suite. Typically added in benchmark-specific branches so the baseline branch stays clean.

↔Harbor

.agent/dir

Optional workspace artefacts — reusable instructions, notes, prompts, skill files the meta-agent can draw on between runs.

←AutoAgent

results.tsvfile

AutoAgent's experiment log. Gitignored. Written by the meta-agent — one row per round with the mutation, score, and kept/discarded decision.

↔AutoAgent

run.logfile

Latest Harbor run output. uv run harbor run ... > run.log 2>&1. The meta-agent tails this to diagnose failures before proposing the next mutation.

←Harbor

plan.mdfile

AGS-specific. Versioned RSPL resource written by the Orchestrator. Human-readable flowchart + ordered subtask list + assignments to sub-agents. Coordination structure is itself inspectable and evolvable.

↔AGS agent bus

data/signals/run-history.jsonfile

Autoresearch's append-only run manifest. One entry per round with patch, score, KV key. Pairs with the per-round KV entry for full auditability.

↔Autoresearch KV

AGP paper benchmarks

GPQA-Diamondbenchmark

198 graduate STEM MCQs. Closed-book; Google-proof. Measures deep scientific reasoning beyond factual recall. Used by the AGP paper to show evolve-prompt+solution gains.

↔AGP

AIME24 / AIME25benchmark

Competition math (American Invitational Mathematics Examination). 30 problems each. Measures long-horizon symbolic reasoning + arithmetic precision. GPT-4.1 went from 23.3% → 40.0% on AIME24 under evolve-prompt+solution (+71.4%).

↔AGP weak-models-gain-more

GAIA Testbenchmark

300 real-world multi-step tasks. Planning + reliable tool use (browsing, documents, files). Used for AGS tool-chaining evaluation.

↔AGP

LeetCode (in-house)benchmark

200 train / 100 test. Multi-language (Python · C++ · Java · Go). Reduced data-contamination cut of recent problems. Metrics: acceptance · pass rate · runtime.

↔AGP

arXiv:2604.15034paper

Zhang, Wentao. "Autogenesis: A Self-Evolving Agent Protocol." 2026-04-16. Nanyang Technological University. 24 pages. Introduces AGP (RSPL + SEPL) and AGS.

↗arxiv pdf

Commands

uv synccmd

Install AutoAgent's Python deps. Requires uv (astral.sh/uv). Base image built via docker build -f Dockerfile.base -t autoagent-base ..

←AutoAgent

uv run harbor runcmd

Execute one or all Harbor tasks against agent.py. -n 100 for parallel; output lands in jobs/. Meta-agent reads score + trajectories from there.

↔Harbor run.log

Read program.md and let's kick off a new experiment!prompt

The exact prompt AutoAgent uses to start a meta-agent run. Point any coding agent at the repo and paste this — the agent does the rest overnight.

↔AutoAgent program.md

npx tsx scripts/run-gtm-loop.tscmd

Autoresearch's per-round runner (reference deployment). Full pipeline via scripts/run-all.sh. All scripts idempotent — re-running never duplicates.

↔Autoresearch

wrangler kv {put,get,list}cmd

Drives autoresearch's version store. Wrapped by scripts/lib/kv-store.ts (idempotent writes) in the reference deployment. Depends on a logged-in local wrangler.

↔KV

replay(anchor, round)proc

Reconstruct any prior state: start from anchor, apply patches 1..round in order via apply_rfc6902. Cheap — rounds per client are hundreds, not millions.

↔KV RFC 6902

Related guides

autoagent-autoresearch-guideguide

The prose guide this wiki distills. 10 tabs — Overview · The Loop · AutoAgent · Autogenesis (AGP) · Autoresearch · Harbor · JSON Patch + KV · Model Escalation · Hermes Runtime · Quick Start · Glossary + FAQ.

↗autoagent-autoresearch-guide.pages.dev

hermes-pi-harness-guideguide

Runtime story. Hermes is the production host for graduated AutoAgent harnesses — launchd-supervised Pi harnesses on claws-mac-mini.

↗hermes-pi-harness-guide.pages.dev

duraclaw-guide / duraclaw-wikiguide

Sibling session orchestrator. Multi-provider executor registry (Claude · Codex · OpenCode) mirrors the same adapter-shape AGP encourages for Tool resources.

↗duraclaw-wiki.pages.dev

gtm-autoresearch-guideguide

Concrete autoresearch deployment — GTM container mutation against per-client Meta Ads evals. Anchor for the patterns described here.

↗gtm-autoresearch-guide.pages.dev