# pcq Agent Guide pcq is the contract for agent-run ML experiments. The contract specification (cq.yaml format, JSON contracts, MCP tool surface, strictness, conformance, schema versioning) lives under https://github.com/playidea-lab/pcq/tree/main/spec — that directory is the single source of truth. This guide describes the contract from the agent's perspective. The Apache-2.0 Python package distributed via PyPI (`uv add pcq`) is the reference implementation. The CQ Go service worker is a second implementation targeting the same contract; future Go/JS clients are welcome. pcq is not a training framework, model zoo, framework adapter matrix, or experiment tracking SaaS — it is the contract layer that makes arbitrary ML code inspectable, reproducible, verifiable, comparable, and repeatable through standard files and JSON/MCP surfaces. ## Identity - Package: pcq - Import: `import pcq` - CLI: `pcq` - License: Apache-2.0 - Repository: https://github.com/playidea-lab/pcq - PyPI: https://pypi.org/project/pcq/ - Website: https://playidea-lab.github.io/pcq/ Core sentence: ```text pcq does not operate the model. pcq operates the experiment boundary. ``` Runtime contract names: - `cq.yaml` - `CQ_CONFIG_JSON` - `cq://` These names do not mean pcq is usable only with the managed CQ service. ## What pcq Standardizes An experiment project has a `cq.yaml` file: ```yaml name: mnist-mlp cmd: uv run python train.py configs: seed: 42 epochs: 3 output_dir: output monitor: eval_acc mode: max metrics: - epoch - eval_acc artifacts: - output/ ``` The training code can use any ML stack: ```python import pcq cfg = pcq.config() out = pcq.output_dir() # Run any training code here. score = 0.82 pcq.log(epoch=0, eval_acc=score) pcq.save_all( history=[{"epoch": 0, "eval_acc": score}], status="completed", artifacts={"model": "model.pt"}, ) ``` ## Read Path For Agents Use these commands before editing or running: ```bash pcq resolve --json pcq inspect . --json pcq validate . --strictness 2 --json ``` The agent should identify: - project root - selected `cq.yaml` - command to run - declared metrics - output directory - existing artifacts - previous run records - validation warnings or blocking failures ## Run Path For Agents For a final result object only: ```bash pcq run --path . --json ``` For live events: ```bash pcq run --path . --jsonl ``` For a final JSON object plus event evidence in a file: ```bash pcq run --path . --events output/events.jsonl --json ``` JSONL events are newline-delimited JSON objects. Each event includes at least: - `schema_version` - `seq` - `time` - `event` Important event types: - `run.started` - `stdout` - `stderr` - `metric` - `run.completed` - `run.failed` - `run.error` Metric events are derived from `pcq.log(...)` stdout lines in `@key=value` format. ## Post-Run Path For Agents After the process exits: ```bash pcq validate-run output --strictness 3 --json pcq describe-run output --json ``` For comparing two iterations: ```bash pcq compare-runs old_output new_output --json pcq lineage new_output --json ``` `describe-run` and `compare-runs` expose decision facts. They intentionally do not decide whether to continue, rollback, or accept a run. ## Standard Artifacts A valid run should produce: - `config.json` - `metrics.json` - `manifest.json` - `run_summary.json` - `run_record.json` - `validation_report.json` `run_record.json` is the canonical completion record. It should contain source, environment, input, metric, artifact, validation, lineage, and summary evidence. ## Agent Behavior Do: - Prefer JSON/JSONL surfaces over scraping terminal prose. - Keep project-specific model, dataset, loss, optimizer, scheduler, and framework code in the user's project. - Declare metrics in `cq.yaml` before emitting them with `pcq.log(...)`. - Use `pcq.output_dir()` rather than hard-coded `output/` paths. - Treat failed runs as evidence when partial artifacts can be preserved. Do not: - Treat process exit code alone as experiment success. - Assume pcq owns the training loop. - Assume CQ service is required. - Add framework adapters when direct contract code is enough. - Edit pcq internals for one project-specific experiment. ## MCP Integration (v4.1.0) pcq ships an optional Model Context Protocol server so agent runtimes (Claude Code, Codex, custom LLM clients) can call pcq with structured JSON instead of shelling out and parsing stdout. Install: ```bash uv add 'pcq[mcp]' ``` Wire the project: ```bash pcq init-experiment --output ./my-exp --agent claude pcq agent install --target claude --path ./my-exp --mcp ``` The `--mcp` flag merges the following into `.mcp.json` (existing entries preserved): ```json { "mcpServers": { "pcq": { "command": "pcq", "args": ["mcp", "serve"] } } } ``` Serve: ```bash pcq mcp serve # stdio (default) pcq mcp serve --transport sse --host 127.0.0.1 --port 8765 ``` 14 MCP tools (read-only tools never mkdir, never mutate cq.yaml, never spawn subprocesses): | Tool | Read-only | Maps to | |------|-----------|---------| | `resolve_project` | yes | `pcq resolve` | | `inspect_project` | yes | `pcq inspect` | | `validate_project` | yes | `pcq validate` | | `validate_run` | yes | `pcq validate-run` | | `describe_run` | yes | `pcq describe-run` | | `compare_runs` | yes | `pcq compare-runs` | | `lineage_chain` | yes | `pcq lineage` | | `apply_plan` | no | `pcq apply-plan` | | `apply_planset` | no | `pcq apply-planset` | | `init_experiment` | no | `pcq init-experiment` | | `finalize_run` | no | `pcq finalize` | | `agent_install` | no | `pcq agent install` | | `agent_status` | yes | `pcq agent status` | | `run_experiment` | no | `pcq run` | Every tool's input/output is anchored in the `pcq.agent.json_contracts.JSON_CONTRACTS` registry frozen since v2.13. Long multi-hour GPU training should not block the in-process `run_experiment` tool. For that workload, prefer the CQ service queue which consumes the same contract. Embed the registry directly without the MCP server wrapper: ```python from pcq.mcp.tools import build_tools import asyncio tools = build_tools() resolve = next(t for t in tools if t.name == "resolve_project") result = asyncio.run(resolve.handler({"path": "."})) ``` ## Evidence Metadata Fields Three agent-fillable APIs enrich the run record with machine-environment, identity, and data-fingerprint evidence. All three are additive — omitting them does not invalidate a run at any strictness level, but including them enables richer comparison, audit, and reproducibility checks. ### pcq.attribution — who ran this experiment ```python pcq.attribution( operator="claude-sonnet-4-6", # agent or service that launched the run author="alice@example.com", # human who owns the experiment (optional) committer="bob@example.com", # git committer identity (optional) ) ``` All three parameters accept free-form strings; the contract normalises them but does not validate format. Call once, typically before `pcq.run()` or at the start of the training script. Environment variable equivalents (env overrides constructor args): | Variable | Maps to | Notes | |---|---|---| | `CQ_ATTRIBUTION_OPERATOR` | operator | agent id, bot username, service name | | `CQ_ATTRIBUTION_AUTHOR` | author | human email or display name | | `CQ_ATTRIBUTION_COMMITTER` | committer | git committer identity | | `CQ_ATTRIBUTION_WIKI_SOURCE` | — | URL when content originates from a wiki page | | `CQ_ATTRIBUTION_BOT_MODEL` | — | exact model id for bot/LLM-generated content | | `CQ_ATTRIBUTION_BOT_VERSION` | — | version string of the bot runner | | `CQ_ATTRIBUTION_CONTEXT_URL` | — | permalink to the conversation or task context | | `CQ_ATTRIBUTION_TEAM_ID` | — | team or org identifier for multi-tenant setups | **When to set**: always when an agent or CI bot runs an experiment, so audit trails can distinguish human-initiated from bot-initiated runs. For Wiki or automated content, set `CQ_ATTRIBUTION_WIKI_SOURCE` and `CQ_ATTRIBUTION_BOT_MODEL` for full model-context provenance. PII policy: operator/author/committer values are stored verbatim and appear in `run_record.json`. Do not place secrets or sensitive PII in these fields. Governed by **R10** (attribution field retention) and **R14** (consent for personally-identifying operator strings; use a pseudonym or service account where consent is not established). ### pcq.worker_spec — machine environment fingerprint ```python spec = pcq.worker_spec() # Returns and stores: cpu model, core count, RAM, GPU list, OS, Python version. ``` `worker_spec()` requires no arguments; it auto-detects the current host. The result is attached to `run_record.json` under `environment.worker_spec`. Environment variable overrides (useful in containers or CI where detection is unreliable): | Variable | Meaning | |---|---| | `CQ_WORKER_CPU_MODEL` | CPU model string | | `CQ_WORKER_CPU_CORES` | logical core count | | `CQ_WORKER_RAM_GB` | total RAM in gigabytes | | `CQ_WORKER_GPU_COUNT` | number of GPUs | | `CQ_WORKER_GPU_0_NAME` | name of primary GPU | | `CQ_WORKER_GPU_0_VRAM_GB` | VRAM of primary GPU (GB) | | `CQ_WORKER_GPU_0_DRIVER` | driver version string | | `CQ_WORKER_OS` | OS identifier (e.g. `linux`, `darwin`, `windows`) | | `CQ_WORKER_OS_VERSION` | OS version string | | `CQ_WORKER_PYTHON_VERSION` | Python version string | | `CQ_WORKER_HOSTNAME` | hostname override | | `CQ_WORKER_CONTAINER_ID` | container / pod id | | `CQ_WORKER_REGION` | cloud region or datacenter label | **When to set**: set env vars when the auto-detected values are wrong (cloud VMs, WSL2, Docker). Do not set `CQ_WORKER_HOSTNAME` to a real hostname if that hostname is considered PII in your deployment. PII policy: `worker_spec` fields are environment-infrastructure metadata, not personal data. Governed by **R5** (machine-env retention; purge on infrastructure decommission) and **R5b** (cloud-instance identifiers; replace with region+type labels before external sharing). ### pcq.fingerprint — PII-safe data statistics ```python pcq.fingerprint( X, # array-like: features y=None, # array-like: labels (optional) modality="tabular", # "tabular" | "image" | "text" | "audio" | "video" | "other" task_kind="classification", # "classification" | "regression" | "generation" | ... domain="general", # "general" | "medical" | "financial" | "legal" | ... ) ``` `fingerprint()` computes and stores in `run_record.json`: - shape, dtype, and content hash (SHA-256 of sorted row hashes — row order independent, satisfies **R15** byte-identical reproducibility gate) - per-column distribution stats (mean, std, min, max, null rate) for tabular - class balance for classification targets - modality-appropriate structural stats for image/text/audio No raw data values are stored. The hash is one-way; the contract cannot reconstruct the dataset from the fingerprint. Environment variable equivalents: | Variable | Meaning | |---|---| | `CQ_FINGERPRINT_MODALITY` | modality string override | | `CQ_FINGERPRINT_TASK_KIND` | task_kind string override | | `CQ_FINGERPRINT_DOMAIN` | domain string override | | `CQ_FINGERPRINT_DISABLE` | set to `1` to skip fingerprinting entirely | | `CQ_FINGERPRINT_HASH_ONLY` | set to `1` to record hash only (no dist stats) | **Domain gate**: when `domain` is `"medical"` or `"financial"`, fingerprint recording is restricted to **declared-only mode** — only fields explicitly listed in `cq.yaml` under `fingerprint.declared_fields` are included in the stats output. This prevents accidental leakage of sensitive column names or distribution shapes. Other regulated domains (e.g. `"legal"`) should also use declared-only mode as a precaution. **When to set**: call `pcq.fingerprint()` immediately after loading the dataset, before any train/test split. This records the canonical dataset identity for the run. Agents that skip fingerprinting cannot benefit from the R15 byte-identical reproducibility gate in downstream validation. PII policy: fingerprint stores only statistical summaries and a content hash. Governed by **R10** (fingerprint field retention in run_record), **R5b** (suppress column names for sensitive domains unless declared), **R14** (domain declaration required for medical/financial data). ## Inference Metrics (v4.7, recommended) 11 recommended keys for benchmarking, profiling, and runtime observability. Recommendation only — no validation gate, no schema change. `metrics.json` remains free-key; these keys provide a stable vocabulary so comparisons and dashboards work across projects. Emit via `pcq.log(...)` or `@key=value` stdout lines: ```python # 학습 루프 내 — @key=value 패턴 print(f"@latency_p50_ms={p50:.2f} @throughput_qps={qps:.1f}") print(f"@memory_peak_mb={mem_mb:.0f} @vram_peak_mb={vram_mb:.0f}") # 또는 pcq.log() 직접 호출 pcq.log( batch_size=32, sequence_length=512, latency_mean_ms=18.4, tokens_per_sec=2048.0, time_to_first_token_ms=5.1, ) ``` Or via environment / cq.yaml metrics declaration: ```yaml # cq.yaml metrics: - latency_p50_ms - latency_p95_ms - latency_p99_ms - latency_mean_ms - throughput_qps - tokens_per_sec - time_to_first_token_ms - memory_peak_mb - vram_peak_mb - batch_size - sequence_length ``` Key reference: | Key | Unit | Meaning | Usage pattern | |-----|------|---------|---------------| | `latency_p50_ms` | ms | median request latency | benchmark loop | | `latency_p95_ms` | ms | 95th-percentile latency | SLA check | | `latency_p99_ms` | ms | 99th-percentile latency | tail-latency audit | | `latency_mean_ms` | ms | mean request latency | average perf trend | | `throughput_qps` | req/s | queries per second | capacity planning | | `tokens_per_sec` | tok/s | token generation throughput | LLM inference | | `time_to_first_token_ms` | ms | TTFT — first token latency | streaming UX | | `memory_peak_mb` | MB | peak CPU / system RAM usage | OOM prevention | | `vram_peak_mb` | MB | peak GPU VRAM usage | GPU fit check | | `batch_size` | — | inference batch size | tuning axis | | `sequence_length` | — | input sequence length (tokens) | tuning axis | Notes: - `latency_*` keys should reflect wall-clock time from request receipt to response completion (excluding network I/O unless stated otherwise). - `memory_peak_mb` and `vram_peak_mb` are peak snapshots during the run, not averages. Use `psutil.Process().memory_info().rss / 1e6` and `torch.cuda.max_memory_allocated() / 1e6` respectively. - `throughput_qps` and `tokens_per_sec` are complementary: QPS counts requests, tokens/s counts generated tokens. Both can be logged per epoch or per benchmark window. - All 11 keys appear in `compare-runs` output when present in both runs, enabling automated regression detection. ## Failure Categories (v4.8) `failure.category` in `run_summary.json` is set by a regex-based heuristic on `failure.message`. Free string values are valid when no pattern matches. | Category | Meaning | Example trigger | Retry / Abort hint | |---|---|---|---| | `config_error` | `cq.yaml` or runtime config invalid | `"invalid cq.yaml: unknown key 'batch'"` | Abort — fix `cq.yaml` before retry | | `missing_dependency` | Required Python package absent | `"ModuleNotFoundError: No module named 'timm'"` | Abort — `uv add ` then retry | | `dataset_missing` | Dataset file or URI not found | `"FileNotFoundError: data/train.csv not found"` | Abort — verify inputs then retry | | `dataset_shape` | Tensor / array dimension mismatch | `"Expected shape (N,3,H,W) got (N,1,H,W)"` | Abort — fix tensor dims in code | | `label_contract` | Label range or dtype violation | `"Label 5 out of range [0,4]"` | Abort — check label range / dtype | | `loss_contract` | Loss function received incompatible inputs | `"loss() got unexpected keyword argument 'reduction'"` | Abort — check loss signature | | `metric_contract` | Undeclared metric emitted at strictness ≥ 3 | `"Undeclared metric 'f1_score'"` | Abort — declare in `cq.yaml.metrics` | | `oom` | CUDA or host memory exhausted | `"CUDA out of memory at batch 17"` | Retry with smaller `batch_size` | | `nan_loss` | Loss became NaN or Inf | `"Loss is NaN at epoch 2, step 140"` | Retry with lower `lr` or gradient clipping | | `timeout` | Run exceeded time budget | `"Run timed out after 3600s"` | Retry with larger `time_budget` | | `distributed_write_race` | Concurrent writers collided on artifact path | `"OSError: [Errno 17] File exists: output/run_record.json"` | Retry with fewer concurrent writers | | `accuracy_below_threshold` | Validation metric below acceptance threshold | `"eval_acc 0.42 < required 0.80"` | Retune (smaller `lr`, longer training) | | `user_interrupted` | Explicit user / operator signal | `"KeyboardInterrupt"`, `"SIGTERM received"` | Respect — do not auto-retry | | `disk_full` | Output directory out of disk space | `"OSError: [Errno 28] No space left on device"` | Abort — free space then retry | | `model_load_failed` | Checkpoint / weights file could not be loaded | `"RuntimeError: PytorchStreamReader failed reading zip archive"` | Re-download or check integrity, then retry | | `unknown_exception` | Unclassified exception | any unrecognised traceback | Manual investigation before retry | The heuristic classifier lives at `spec/agent/failure_classifier.py`. ## Examples ### sklearn — RandomForest on Iris ```yaml # cq.yaml name: sklearn-baseline cmd: uv run python train.py configs: output_dir: output seed: 42 n_estimators: 100 monitor: eval_acc mode: max metrics: - epoch - eval_acc artifacts: - output/ ``` ```python # train.py import pcq, joblib from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split cfg = pcq.config() out = pcq.output_dir() pcq.seed_everything(cfg.get("seed", 42)) X, y = load_iris(return_X_y=True) X_tr, X_te, y_tr, y_te = train_test_split( X, y, test_size=0.2, random_state=cfg["seed"] ) model = RandomForestClassifier(n_estimators=cfg.get("n_estimators", 100)) model.fit(X_tr, y_tr) acc = float(model.score(X_te, y_te)) pcq.log(epoch=0, eval_acc=acc) joblib.dump(model, out / "model.pkl") pcq.save_all(history=[{"epoch": 0, "eval_acc": acc}], artifacts={"model": "model.pkl"}) ``` ### PyTorch — agnostic training loop ```python import pcq, torch from torch import nn cfg = pcq.config() out = pcq.output_dir() pcq.seed_everything(cfg.get("seed", 42)) model = nn.Linear(cfg["in_dim"], cfg["out_dim"]) opt = torch.optim.Adam(model.parameters(), lr=cfg["lr"]) history = [] for epoch in range(cfg["epochs"]): train_loss = train_one_epoch(model, opt) # user code val_acc = evaluate(model) # user code pcq.log(epoch=epoch, train_loss=train_loss, val_acc=val_acc) history.append({"epoch": epoch, "train_loss": train_loss, "val_acc": val_acc}) torch.save(model.state_dict(), out / "model.pt") pcq.save_all(history=history, artifacts={"model": "model.pt"}) ``` ## Tool Response Samples Real captured responses for four canonical MCP tools are inlined at https://playidea-lab.github.io/pcq/#tools-catalog. Each sample comes from a real run of `examples/contract_sklearn`; only volatile fields (timestamps, git_sha, sha256 hashes, absolute paths) are elided as `"..."`. Anchors: - `#tools-catalog-describe_run` — compact RunRecord summary - `#tools-catalog-compare_runs` — diff between two RunRecords - `#tools-catalog-validate_run` — strictness=3 pass/warn/fail report - `#tools-catalog-lineage_chain` — parent chain walk `agent-manifest.json` mirrors these under the `tool_response_samples` array (`name` + `command` + `mcp_tool` + `sample_anchor` per entry), so agents that prefer JSON over HTML can discover the same evidence machine-readably. ## Case Studies Four production dogfoods (pcq's own validation cycle on real ML workloads) are documented in `docs/case-studies/` and surfaced on the site at `#case-studies`: - **MNIST Dogfood (2026-05-08)** — pcq v2.11, MNIST digits, 9 fresh agent generations, eval_acc 0.9583 → 1.0. First end-to-end dogfood; drove the v2.12 round of fixes. - **Tabular Dogfood (2026-05-09)** — pcq 3.0.1, breast-cancer dataset, TabPFN/PyCaret/FLAML/XGBoost/sklearn diversity. First post-PyPI install path validation. - **MCP Dogfood (2026-05-10)** — pcq[mcp] 4.1.0, Claude Code MCP. First v4.1.0 MCP loop end-to-end via `mcp__pcq__*` tools (no subprocess CLI). 3 sequential generations. - **CQ Worker Dogfood (2026-05-10)** — pcq[mcp] 4.2.0, CQ Go service worker on RTX 5080. First production CQ Go worker dispatch end-to-end; verified `cq.yaml` + `CQ_CONFIG_JSON` + 6-artifact protocol. ## Spec The contract surface (cq.yaml format, JSON contracts, MCP tools, strictness, schema versioning, conformance) lives at the repository root under `spec/`, separately from the Python implementation in `src/pcq/`. Other languages and runtimes can target the contract without depending on Python. - Index: https://github.com/playidea-lab/pcq/blob/main/spec/INDEX.md - JSON Schemas (auto-exported from `pcq.agent.json_contracts.JSON_CONTRACTS` via `scripts/export_schemas.py`): https://github.com/playidea-lab/pcq/tree/main/spec/schemas - Versioning policy: https://github.com/playidea-lab/pcq/blob/main/spec/VERSIONING.md - Conformance suite: https://github.com/playidea-lab/pcq/blob/main/spec/CONFORMANCE.md (golden input/output pairs at `tests/conformance///`) CI guards drift between the registry and the on-disk schemas via `uv run python scripts/export_schemas.py --check`. ## Roadmap The thesis stays the same: pcq does not compete with the means of training. The work ahead strengthens the framework-neutral evidence and control layer — broader real-world contract coverage, deeper validation/lineage facts, more machine-readable surfaces for agent runtimes, tighter integration with the CQ managed consumer. Built-in models, losses, datasets, and per-framework adapter matrices remain deliberately out of scope. - v4 Direction: https://github.com/playidea-lab/pcq/blob/main/docs/V4_DIRECTION.md - Completion Roadmap: https://github.com/playidea-lab/pcq/blob/main/docs/PCQ_COMPLETION_ROADMAP.md - Releases: https://github.com/playidea-lab/pcq/releases ## Related Docs - v4 direction: https://github.com/playidea-lab/pcq/blob/main/docs/V4_DIRECTION.md - Introduction: https://github.com/playidea-lab/pcq/blob/main/docs/INTRODUCTION.md - JSON contracts: https://github.com/playidea-lab/pcq/blob/main/docs/JSON_CONTRACTS.md - Agent guide: https://github.com/playidea-lab/pcq/blob/main/docs/AGENT_OPERATING_GUIDE.md - Strictness: https://github.com/playidea-lab/pcq/blob/main/docs/STRICTNESS.md - Runtime contract: https://github.com/playidea-lab/pcq/blob/main/docs/CQ_YAML_RUNTIME_CONTRACT.md - MCP integration: https://github.com/playidea-lab/pcq/blob/main/docs/MCP_INTEGRATION.md