# pcq Agent Guide

pcq is the contract for agent-run ML experiments. The contract specification
(cq.yaml format, JSON contracts, MCP tool surface, strictness, conformance,
schema versioning) lives under
https://github.com/playidea-lab/pcq/tree/main/spec — that directory is the
single source of truth. This guide describes the contract from the agent's
perspective.

The Apache-2.0 Python package distributed via PyPI (`uv add pcq`) is the
reference implementation. The CQ Go service worker is a second implementation
targeting the same contract; future Go/JS clients are welcome. pcq is not a
training framework, model zoo, framework adapter matrix, or experiment
tracking SaaS — it is the contract layer that makes arbitrary ML code
inspectable, reproducible, verifiable, comparable, and repeatable through
standard files and JSON/MCP surfaces.

## Identity

- Package: pcq
- Import: `import pcq`
- CLI: `pcq`
- License: Apache-2.0
- Repository: https://github.com/playidea-lab/pcq
- PyPI: https://pypi.org/project/pcq/
- Website: https://playidea-lab.github.io/pcq/

Core sentence:

```text
pcq does not operate the model.
pcq operates the experiment boundary.
```

Runtime contract names:

- `cq.yaml`
- `CQ_CONFIG_JSON`
- `cq://`

These names do not mean pcq is usable only with the managed CQ service.

## What pcq Standardizes

An experiment project has a `cq.yaml` file:

```yaml
name: mnist-mlp
cmd: uv run python train.py
configs:
  seed: 42
  epochs: 3
  output_dir: output
  monitor: eval_acc
  mode: max
metrics:
  - epoch
  - eval_acc
artifacts:
  - output/
```

The training code can use any ML stack:

```python
import pcq

cfg = pcq.config()
out = pcq.output_dir()

# Run any training code here.
score = 0.82

pcq.log(epoch=0, eval_acc=score)
pcq.save_all(
    history=[{"epoch": 0, "eval_acc": score}],
    status="completed",
    artifacts={"model": "model.pt"},
)
```

## Read Path For Agents

Use these commands before editing or running:

```bash
pcq resolve --json
pcq inspect . --json
pcq validate . --strictness 2 --json
```

The agent should identify:

- project root
- selected `cq.yaml`
- command to run
- declared metrics
- output directory
- existing artifacts
- previous run records
- validation warnings or blocking failures

## Run Path For Agents

For a final result object only:

```bash
pcq run --path . --json
```

For live events:

```bash
pcq run --path . --jsonl
```

For a final JSON object plus event evidence in a file:

```bash
pcq run --path . --events output/events.jsonl --json
```

JSONL events are newline-delimited JSON objects. Each event includes at least:

- `schema_version`
- `seq`
- `time`
- `event`

Important event types:

- `run.started`
- `stdout`
- `stderr`
- `metric`
- `run.completed`
- `run.failed`
- `run.error`

Metric events are derived from `pcq.log(...)` stdout lines in `@key=value`
format.

## Post-Run Path For Agents

After the process exits:

```bash
pcq validate-run output --strictness 3 --json
pcq describe-run output --json
```

For comparing two iterations:

```bash
pcq compare-runs old_output new_output --json
pcq lineage new_output --json
```

`describe-run` and `compare-runs` expose decision facts. They intentionally do
not decide whether to continue, rollback, or accept a run.

## Standard Artifacts

A valid run should produce:

- `config.json`
- `metrics.json`
- `manifest.json`
- `run_summary.json`
- `run_record.json`
- `validation_report.json`

`run_record.json` is the canonical completion record. It should contain source,
environment, input, metric, artifact, validation, lineage, and summary evidence.

## Agent Behavior

Do:

- Prefer JSON/JSONL surfaces over scraping terminal prose.
- Keep project-specific model, dataset, loss, optimizer, scheduler, and
  framework code in the user's project.
- Declare metrics in `cq.yaml` before emitting them with `pcq.log(...)`.
- Use `pcq.output_dir()` rather than hard-coded `output/` paths.
- Treat failed runs as evidence when partial artifacts can be preserved.

Do not:

- Treat process exit code alone as experiment success.
- Assume pcq owns the training loop.
- Assume CQ service is required.
- Add framework adapters when direct contract code is enough.
- Edit pcq internals for one project-specific experiment.

## MCP Integration (v4.1.0)

pcq ships an optional Model Context Protocol server so agent runtimes
(Claude Code, Codex, custom LLM clients) can call pcq with structured
JSON instead of shelling out and parsing stdout.

Install:

```bash
uv add 'pcq[mcp]'
```

Wire the project:

```bash
pcq init-experiment --output ./my-exp --agent claude
pcq agent install --target claude --path ./my-exp --mcp
```

The `--mcp` flag merges the following into `.mcp.json` (existing entries
preserved):

```json
{
  "mcpServers": {
    "pcq": {
      "command": "pcq",
      "args": ["mcp", "serve"]
    }
  }
}
```

Serve:

```bash
pcq mcp serve                                            # stdio (default)
pcq mcp serve --transport sse --host 127.0.0.1 --port 8765
```

14 MCP tools (read-only tools never mkdir, never mutate cq.yaml, never
spawn subprocesses):

| Tool | Read-only | Maps to |
|------|-----------|---------|
| `resolve_project` | yes | `pcq resolve` |
| `inspect_project` | yes | `pcq inspect` |
| `validate_project` | yes | `pcq validate` |
| `validate_run` | yes | `pcq validate-run` |
| `describe_run` | yes | `pcq describe-run` |
| `compare_runs` | yes | `pcq compare-runs` |
| `lineage_chain` | yes | `pcq lineage` |
| `apply_plan` | no | `pcq apply-plan` |
| `apply_planset` | no | `pcq apply-planset` |
| `init_experiment` | no | `pcq init-experiment` |
| `finalize_run` | no | `pcq finalize` |
| `agent_install` | no | `pcq agent install` |
| `agent_status` | yes | `pcq agent status` |
| `run_experiment` | no | `pcq run` |

Every tool's input/output is anchored in the
`pcq.agent.json_contracts.JSON_CONTRACTS` registry frozen since v2.13.

Long multi-hour GPU training should not block the in-process
`run_experiment` tool. For that workload, prefer the CQ service queue
which consumes the same contract.

Embed the registry directly without the MCP server wrapper:

```python
from pcq.mcp.tools import build_tools
import asyncio

tools = build_tools()
resolve = next(t for t in tools if t.name == "resolve_project")
result = asyncio.run(resolve.handler({"path": "."}))
```

## Evidence Metadata Fields

Three agent-fillable APIs enrich the run record with machine-environment,
identity, and data-fingerprint evidence. All three are additive — omitting
them does not invalidate a run at any strictness level, but including them
enables richer comparison, audit, and reproducibility checks.

### pcq.attribution — who ran this experiment

```python
pcq.attribution(
    operator="claude-sonnet-4-6",   # agent or service that launched the run
    author="alice@example.com",     # human who owns the experiment (optional)
    committer="bob@example.com",    # git committer identity (optional)
)
```

All three parameters accept free-form strings; the contract normalises them
but does not validate format. Call once, typically before `pcq.run()` or at
the start of the training script.

Environment variable equivalents (env overrides constructor args):

| Variable | Maps to | Notes |
|---|---|---|
| `CQ_ATTRIBUTION_OPERATOR` | operator | agent id, bot username, service name |
| `CQ_ATTRIBUTION_AUTHOR` | author | human email or display name |
| `CQ_ATTRIBUTION_COMMITTER` | committer | git committer identity |
| `CQ_ATTRIBUTION_WIKI_SOURCE` | — | URL when content originates from a wiki page |
| `CQ_ATTRIBUTION_BOT_MODEL` | — | exact model id for bot/LLM-generated content |
| `CQ_ATTRIBUTION_BOT_VERSION` | — | version string of the bot runner |
| `CQ_ATTRIBUTION_CONTEXT_URL` | — | permalink to the conversation or task context |
| `CQ_ATTRIBUTION_TEAM_ID` | — | team or org identifier for multi-tenant setups |

**When to set**: always when an agent or CI bot runs an experiment, so audit
trails can distinguish human-initiated from bot-initiated runs. For Wiki or
automated content, set `CQ_ATTRIBUTION_WIKI_SOURCE` and
`CQ_ATTRIBUTION_BOT_MODEL` for full model-context provenance.

PII policy: operator/author/committer values are stored verbatim and appear in
`run_record.json`. Do not place secrets or sensitive PII in these fields.
Governed by **R10** (attribution field retention) and **R14** (consent for
personally-identifying operator strings; use a pseudonym or service account
where consent is not established).

### pcq.worker_spec — machine environment fingerprint

```python
spec = pcq.worker_spec()
# Returns and stores: cpu model, core count, RAM, GPU list, OS, Python version.
```

`worker_spec()` requires no arguments; it auto-detects the current host. The
result is attached to `run_record.json` under `environment.worker_spec`.

Environment variable overrides (useful in containers or CI where detection
is unreliable):

| Variable | Meaning |
|---|---|
| `CQ_WORKER_CPU_MODEL` | CPU model string |
| `CQ_WORKER_CPU_CORES` | logical core count |
| `CQ_WORKER_RAM_GB` | total RAM in gigabytes |
| `CQ_WORKER_GPU_COUNT` | number of GPUs |
| `CQ_WORKER_GPU_0_NAME` | name of primary GPU |
| `CQ_WORKER_GPU_0_VRAM_GB` | VRAM of primary GPU (GB) |
| `CQ_WORKER_GPU_0_DRIVER` | driver version string |
| `CQ_WORKER_OS` | OS identifier (e.g. `linux`, `darwin`, `windows`) |
| `CQ_WORKER_OS_VERSION` | OS version string |
| `CQ_WORKER_PYTHON_VERSION` | Python version string |
| `CQ_WORKER_HOSTNAME` | hostname override |
| `CQ_WORKER_CONTAINER_ID` | container / pod id |
| `CQ_WORKER_REGION` | cloud region or datacenter label |

**When to set**: set env vars when the auto-detected values are wrong (cloud
VMs, WSL2, Docker). Do not set `CQ_WORKER_HOSTNAME` to a real hostname if
that hostname is considered PII in your deployment.

PII policy: `worker_spec` fields are environment-infrastructure metadata, not
personal data. Governed by **R5** (machine-env retention; purge on
infrastructure decommission) and **R5b** (cloud-instance identifiers; replace
with region+type labels before external sharing).

### pcq.fingerprint — PII-safe data statistics

```python
pcq.fingerprint(
    X,                       # array-like: features
    y=None,                  # array-like: labels (optional)
    modality="tabular",      # "tabular" | "image" | "text" | "audio" | "video" | "other"
    task_kind="classification",  # "classification" | "regression" | "generation" | ...
    domain="general",        # "general" | "medical" | "financial" | "legal" | ...
)
```

`fingerprint()` computes and stores in `run_record.json`:

- shape, dtype, and content hash (SHA-256 of sorted row hashes — row order
  independent, satisfies **R15** byte-identical reproducibility gate)
- per-column distribution stats (mean, std, min, max, null rate) for tabular
- class balance for classification targets
- modality-appropriate structural stats for image/text/audio

No raw data values are stored. The hash is one-way; the contract cannot
reconstruct the dataset from the fingerprint.

Environment variable equivalents:

| Variable | Meaning |
|---|---|
| `CQ_FINGERPRINT_MODALITY` | modality string override |
| `CQ_FINGERPRINT_TASK_KIND` | task_kind string override |
| `CQ_FINGERPRINT_DOMAIN` | domain string override |
| `CQ_FINGERPRINT_DISABLE` | set to `1` to skip fingerprinting entirely |
| `CQ_FINGERPRINT_HASH_ONLY` | set to `1` to record hash only (no dist stats) |

**Domain gate**: when `domain` is `"medical"` or `"financial"`, fingerprint
recording is restricted to **declared-only mode** — only fields explicitly
listed in `cq.yaml` under `fingerprint.declared_fields` are included in the
stats output. This prevents accidental leakage of sensitive column names or
distribution shapes. Other regulated domains (e.g. `"legal"`) should also use
declared-only mode as a precaution.

**When to set**: call `pcq.fingerprint()` immediately after loading the
dataset, before any train/test split. This records the canonical dataset
identity for the run. Agents that skip fingerprinting cannot benefit from
the R15 byte-identical reproducibility gate in downstream validation.

PII policy: fingerprint stores only statistical summaries and a content hash.
Governed by **R10** (fingerprint field retention in run_record),
**R5b** (suppress column names for sensitive domains unless declared),
**R14** (domain declaration required for medical/financial data).

## Inference Metrics (v4.7, recommended)

11 recommended keys for benchmarking, profiling, and runtime observability.
Recommendation only — no validation gate, no schema change. `metrics.json`
remains free-key; these keys provide a stable vocabulary so comparisons and
dashboards work across projects.

Emit via `pcq.log(...)` or `@key=value` stdout lines:

```python
# 학습 루프 내 — @key=value 패턴
print(f"@latency_p50_ms={p50:.2f} @throughput_qps={qps:.1f}")
print(f"@memory_peak_mb={mem_mb:.0f} @vram_peak_mb={vram_mb:.0f}")

# 또는 pcq.log() 직접 호출
pcq.log(
    batch_size=32,
    sequence_length=512,
    latency_mean_ms=18.4,
    tokens_per_sec=2048.0,
    time_to_first_token_ms=5.1,
)
```

Or via environment / cq.yaml metrics declaration:

```yaml
# cq.yaml
metrics:
  - latency_p50_ms
  - latency_p95_ms
  - latency_p99_ms
  - latency_mean_ms
  - throughput_qps
  - tokens_per_sec
  - time_to_first_token_ms
  - memory_peak_mb
  - vram_peak_mb
  - batch_size
  - sequence_length
```

Key reference:

| Key | Unit | Meaning | Usage pattern |
|-----|------|---------|---------------|
| `latency_p50_ms` | ms | median request latency | benchmark loop |
| `latency_p95_ms` | ms | 95th-percentile latency | SLA check |
| `latency_p99_ms` | ms | 99th-percentile latency | tail-latency audit |
| `latency_mean_ms` | ms | mean request latency | average perf trend |
| `throughput_qps` | req/s | queries per second | capacity planning |
| `tokens_per_sec` | tok/s | token generation throughput | LLM inference |
| `time_to_first_token_ms` | ms | TTFT — first token latency | streaming UX |
| `memory_peak_mb` | MB | peak CPU / system RAM usage | OOM prevention |
| `vram_peak_mb` | MB | peak GPU VRAM usage | GPU fit check |
| `batch_size` | — | inference batch size | tuning axis |
| `sequence_length` | — | input sequence length (tokens) | tuning axis |

Notes:

- `latency_*` keys should reflect wall-clock time from request receipt to
  response completion (excluding network I/O unless stated otherwise).
- `memory_peak_mb` and `vram_peak_mb` are peak snapshots during the run,
  not averages. Use `psutil.Process().memory_info().rss / 1e6` and
  `torch.cuda.max_memory_allocated() / 1e6` respectively.
- `throughput_qps` and `tokens_per_sec` are complementary: QPS counts
  requests, tokens/s counts generated tokens. Both can be logged per epoch
  or per benchmark window.
- All 11 keys appear in `compare-runs` output when present in both runs,
  enabling automated regression detection.

## Failure Categories (v4.8)

`failure.category` in `run_summary.json` is set by a regex-based heuristic
on `failure.message`. Free string values are valid when no pattern matches.

| Category | Meaning | Example trigger | Retry / Abort hint |
|---|---|---|---|
| `config_error` | `cq.yaml` or runtime config invalid | `"invalid cq.yaml: unknown key 'batch'"` | Abort — fix `cq.yaml` before retry |
| `missing_dependency` | Required Python package absent | `"ModuleNotFoundError: No module named 'timm'"` | Abort — `uv add <package>` then retry |
| `dataset_missing` | Dataset file or URI not found | `"FileNotFoundError: data/train.csv not found"` | Abort — verify inputs then retry |
| `dataset_shape` | Tensor / array dimension mismatch | `"Expected shape (N,3,H,W) got (N,1,H,W)"` | Abort — fix tensor dims in code |
| `label_contract` | Label range or dtype violation | `"Label 5 out of range [0,4]"` | Abort — check label range / dtype |
| `loss_contract` | Loss function received incompatible inputs | `"loss() got unexpected keyword argument 'reduction'"` | Abort — check loss signature |
| `metric_contract` | Undeclared metric emitted at strictness ≥ 3 | `"Undeclared metric 'f1_score'"` | Abort — declare in `cq.yaml.metrics` |
| `oom` | CUDA or host memory exhausted | `"CUDA out of memory at batch 17"` | Retry with smaller `batch_size` |
| `nan_loss` | Loss became NaN or Inf | `"Loss is NaN at epoch 2, step 140"` | Retry with lower `lr` or gradient clipping |
| `timeout` | Run exceeded time budget | `"Run timed out after 3600s"` | Retry with larger `time_budget` |
| `distributed_write_race` | Concurrent writers collided on artifact path | `"OSError: [Errno 17] File exists: output/run_record.json"` | Retry with fewer concurrent writers |
| `accuracy_below_threshold` | Validation metric below acceptance threshold | `"eval_acc 0.42 < required 0.80"` | Retune (smaller `lr`, longer training) |
| `user_interrupted` | Explicit user / operator signal | `"KeyboardInterrupt"`, `"SIGTERM received"` | Respect — do not auto-retry |
| `disk_full` | Output directory out of disk space | `"OSError: [Errno 28] No space left on device"` | Abort — free space then retry |
| `model_load_failed` | Checkpoint / weights file could not be loaded | `"RuntimeError: PytorchStreamReader failed reading zip archive"` | Re-download or check integrity, then retry |
| `unknown_exception` | Unclassified exception | any unrecognised traceback | Manual investigation before retry |

The heuristic classifier lives at `spec/agent/failure_classifier.py`.

## Examples

### sklearn — RandomForest on Iris

```yaml
# cq.yaml
name: sklearn-baseline
cmd: uv run python train.py
configs:
  output_dir: output
  seed: 42
  n_estimators: 100
  monitor: eval_acc
  mode: max
metrics:
  - epoch
  - eval_acc
artifacts:
  - output/
```

```python
# train.py
import pcq, joblib
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

cfg = pcq.config()
out = pcq.output_dir()
pcq.seed_everything(cfg.get("seed", 42))

X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, random_state=cfg["seed"]
)
model = RandomForestClassifier(n_estimators=cfg.get("n_estimators", 100))
model.fit(X_tr, y_tr)
acc = float(model.score(X_te, y_te))

pcq.log(epoch=0, eval_acc=acc)
joblib.dump(model, out / "model.pkl")
pcq.save_all(history=[{"epoch": 0, "eval_acc": acc}],
             artifacts={"model": "model.pkl"})
```

### PyTorch — agnostic training loop

```python
import pcq, torch
from torch import nn

cfg = pcq.config()
out = pcq.output_dir()
pcq.seed_everything(cfg.get("seed", 42))

model = nn.Linear(cfg["in_dim"], cfg["out_dim"])
opt = torch.optim.Adam(model.parameters(), lr=cfg["lr"])

history = []
for epoch in range(cfg["epochs"]):
    train_loss = train_one_epoch(model, opt)   # user code
    val_acc = evaluate(model)                   # user code
    pcq.log(epoch=epoch, train_loss=train_loss, val_acc=val_acc)
    history.append({"epoch": epoch, "train_loss": train_loss, "val_acc": val_acc})

torch.save(model.state_dict(), out / "model.pt")
pcq.save_all(history=history, artifacts={"model": "model.pt"})
```

## Tool Response Samples

Real captured responses for four canonical MCP tools are inlined at
https://playidea-lab.github.io/pcq/#tools-catalog. Each sample comes from
a real run of `examples/contract_sklearn`; only volatile fields
(timestamps, git_sha, sha256 hashes, absolute paths) are elided as
`"..."`. Anchors:

- `#tools-catalog-describe_run`  — compact RunRecord summary
- `#tools-catalog-compare_runs`  — diff between two RunRecords
- `#tools-catalog-validate_run`  — strictness=3 pass/warn/fail report
- `#tools-catalog-lineage_chain` — parent chain walk

`agent-manifest.json` mirrors these under the `tool_response_samples`
array (`name` + `command` + `mcp_tool` + `sample_anchor` per entry), so
agents that prefer JSON over HTML can discover the same evidence
machine-readably.

## Case Studies

Four production dogfoods (pcq's own validation cycle on real ML
workloads) are documented in `docs/case-studies/` and surfaced on the
site at `#case-studies`:

- **MNIST Dogfood (2026-05-08)** — pcq v2.11, MNIST digits, 9 fresh
  agent generations, eval_acc 0.9583 → 1.0. First end-to-end dogfood;
  drove the v2.12 round of fixes.
- **Tabular Dogfood (2026-05-09)** — pcq 3.0.1, breast-cancer dataset,
  TabPFN/PyCaret/FLAML/XGBoost/sklearn diversity. First post-PyPI
  install path validation.
- **MCP Dogfood (2026-05-10)** — pcq[mcp] 4.1.0, Claude Code MCP. First
  v4.1.0 MCP loop end-to-end via `mcp__pcq__*` tools (no subprocess
  CLI). 3 sequential generations.
- **CQ Worker Dogfood (2026-05-10)** — pcq[mcp] 4.2.0, CQ Go service
  worker on RTX 5080. First production CQ Go worker dispatch end-to-end;
  verified `cq.yaml` + `CQ_CONFIG_JSON` + 6-artifact protocol.

## Spec

The contract surface (cq.yaml format, JSON contracts, MCP tools,
strictness, schema versioning, conformance) lives at the repository
root under `spec/`, separately from the Python implementation in
`src/pcq/`. Other languages and runtimes can target the contract
without depending on Python.

- Index: https://github.com/playidea-lab/pcq/blob/main/spec/INDEX.md
- JSON Schemas (auto-exported from `pcq.agent.json_contracts.JSON_CONTRACTS`
  via `scripts/export_schemas.py`):
  https://github.com/playidea-lab/pcq/tree/main/spec/schemas
- Versioning policy: https://github.com/playidea-lab/pcq/blob/main/spec/VERSIONING.md
- Conformance suite: https://github.com/playidea-lab/pcq/blob/main/spec/CONFORMANCE.md
  (golden input/output pairs at `tests/conformance/<contract>/<case>/`)

CI guards drift between the registry and the on-disk schemas via
`uv run python scripts/export_schemas.py --check`.

## Roadmap

The thesis stays the same: pcq does not compete with the means of
training. The work ahead strengthens the framework-neutral evidence and
control layer — broader real-world contract coverage, deeper
validation/lineage facts, more machine-readable surfaces for agent
runtimes, tighter integration with the CQ managed consumer. Built-in
models, losses, datasets, and per-framework adapter matrices remain
deliberately out of scope.

- v4 Direction: https://github.com/playidea-lab/pcq/blob/main/docs/V4_DIRECTION.md
- Completion Roadmap: https://github.com/playidea-lab/pcq/blob/main/docs/PCQ_COMPLETION_ROADMAP.md
- Releases: https://github.com/playidea-lab/pcq/releases

## Related Docs

- v4 direction: https://github.com/playidea-lab/pcq/blob/main/docs/V4_DIRECTION.md
- Introduction: https://github.com/playidea-lab/pcq/blob/main/docs/INTRODUCTION.md
- JSON contracts: https://github.com/playidea-lab/pcq/blob/main/docs/JSON_CONTRACTS.md
- Agent guide: https://github.com/playidea-lab/pcq/blob/main/docs/AGENT_OPERATING_GUIDE.md
- Strictness: https://github.com/playidea-lab/pcq/blob/main/docs/STRICTNESS.md
- Runtime contract: https://github.com/playidea-lab/pcq/blob/main/docs/CQ_YAML_RUNTIME_CONTRACT.md
- MCP integration: https://github.com/playidea-lab/pcq/blob/main/docs/MCP_INTEGRATION.md