Skip to main content

Quickstart

goalseek works best when each research project is its own audit boundary. The manifest, baseline files, logs, run artifacts, and git history all live inside the project directory.

Prerequisites

  • Python 3.11 or newer
  • git
  • one provider CLI available on PATH such as codex, claude, gemini, or opencode
Clean git state matters

The loop creates candidate commits and may revert them. Start from a clean working tree before you run research iterations.

Install the package

Create a pyproject.toml for the workspace where you want to run goalseek, and point uv at the latest wheel published in the repository dist/ folder:

[project]
name = "goalseek-runner"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
"goalseek @ https://github.com/shambhu112/goalseek/raw/main/dist/goalseek-0.1.1-py3-none-any.whl",
]

Then create the environment and sync it:

uv venv .venv
uv sync

Verify the CLI:

uv run goalseek --help

Create a project

Choose a provider and model, then scaffold a new project.

uv run goalseek project init demo --provider codex --model gpt-5.4-mini

The scaffold includes:

demo/
.git/
manifest.yaml
program.md
setup.py
validate_results.py
experiment.py
config/project.yaml
context/
data/
hidden/
logs/
runs/

Check the manifest

Open demo/manifest.yaml and confirm the core files have the right modes:

files:
- path: manifest.yaml
mode: read_only
- path: program.md
mode: read_only
- path: setup.py
mode: read_only
- path: validate_results.py
mode: hidden
- path: experiment.py
mode: writable
- path: hidden/**
mode: hidden
- path: config/**
mode: read_only
- path: runs/**
mode: generated
- path: logs/**
mode: generated

Validate the manifest:

uv run goalseek manifest validate ./demo

Then inspect demo/config/project.yaml and confirm the hypothesis provider, implementation provider, model names, timeouts, and logging settings match the run you want.

Prepare baseline files

Before baseline, make sure these three files represent a runnable first version of the project:

FilePurpose
experiment.pyFirst implementation the verifier can train or run. It is writable and may be changed by later iterations.
program.mdReusable read-only instructions for the planning and implementation providers. It is visible during READ_CONTEXT, PLAN, and APPLY_CHANGE.
validate_results.pyHidden verification harness. It is not provider context. It runs during VERIFY when referenced by manifest verification commands.

The default manifest usually runs:

verification:
commands:
- name: train
run: python3 experiment.py
cwd: .
timeout_sec: 1200
- name: evaluate
run: python3 validate_results.py --evaluate --output runs/latest/results.json
cwd: .
timeout_sec: 800

Make sure the metric extractor points at the verifier output:

metric:
name: score
direction: maximize
extractor:
type: json_file
path: runs/latest/results.json
json_pointer: /metric

Optional: import sample research assets

This repo includes a small Kaggle-style demo package.

./move-testpackage.sh --overwrite ./demo

After importing, re-check manifest.yaml, program.md, experiment.py, and validate_results.py.

Run the lifecycle

Prepare the project:

uv run goalseek setup ./demo

Commit local scaffold and setup changes before agent-driven edits:

uv run goalseek gittreeclean --message "clean repo" ./demo

Capture the baseline metric. Baseline runs verification on the current project without asking a provider to edit code.

uv run goalseek baseline ./demo

Run a few full iterations:

uv run goalseek run ./demo --iterations 3

Inspect status and summary:

uv run goalseek status ./demo
uv run goalseek summary ./demo

What to expect on disk

  • runs/0000_baseline/ stores baseline artifacts.
  • runs/0001/, runs/0002/, and later directories store iteration-specific prompts, plans, logs, and result records.
  • logs/state.json stores resumable loop state.
  • logs/results.jsonl stores append-only result summaries.
Good first inspection points

Open runs/0001/prompt.md, runs/0001/provider_output.md, and runs/0001/result.json after your first iteration. Those files make the system much easier to reason about.

What happens during a run

Baseline:

  1. discovers the project root
  2. validates manifest.yaml
  3. loads effective config
  4. runs verification commands
  5. extracts the metric
  6. writes runs/0000_baseline/ and initializes logs/state.json

Each later iteration:

READ_CONTEXT -> PLAN -> APPLY_CHANGE -> COMMIT -> VERIFY -> DECIDE -> LOG

The loop resumes from logs/state.json if a previous run stopped between phases.

Common failure modes

  • Missing provider CLI executable: install the matching provider tool and make sure it is available on PATH.
  • Dirty working tree: commit or restore local changes before run.
  • Manifest issues: re-run goalseek manifest validate and check path scopes plus metric extraction rules.
  • Verification failures: inspect runs/<iteration>/verifier.log.