Dataset Consumers

Published datasets are workflow-projected rows on Hugging Face Hub. Use this path for evaluation jobs, teacher/student training, SFT/RL pipelines, dashboards, and public or private dataset sharing.

Load Rows

from datasets import load_dataset

ds = load_dataset("owner/team-traces", split="train")
print(ds[0])

For streaming:

from datasets import load_dataset

ds = load_dataset("owner/team-traces", streaming=True)
for row in ds["train"]:
    print(row)

Rows are workflow-specific. A command-trajectory eval dataset and a PR intent summary dataset will not have the same row schema, even if they came from the same bucket traces.

A dataset is a growing, reviewed seal (ADR-0008): each row was appended under review gates and carries its own provenance record back to the workflow digest, bucket state digest, and any recorded judgment answers that produced it (see Row Provenance: The Contract Triple). Consumers relying on a dataset for grading or replay can check opentraces dataset verify <name> upstream of publication to confirm the published rows still reproduce from the recorded inputs.

File-Oriented Access

For published Hugging Face datasets, the third-party hf-mount tool can expose shards as files:

hf-mount start repo datasets/your-org/agent-traces /mnt/traces
ls /mnt/traces/data/
head -n 1 /mnt/traces/data/*.jsonl
hf-mount stop /mnt/traces

hf-mount is an external tool, not part of opentraces; install it separately and check its own documentation for platform support. For private or gated datasets, authenticate with Hugging Face first.

Resolve Back To Evidence

Rows should carry enough trace references for a consumer to retrieve bucket evidence when it is allowed:

opentraces trace get <trace-id> --remote owner/private-bucket --json
opentraces trail track <trace-id> --json
opentraces ctx <trace-id>:<step-index> --json

That lookup is separate from dataset loading. A public dataset can reference a private bucket without exposing the bucket itself.

●HUMAN ○MACHINE