docs / clients / dataset-consumers

Dataset Consumers

Published datasets are workflow-projected rows on Hugging Face Hub. Use this path for evaluation jobs, teacher/student training, SFT/RL pipelines, dashboards, and public or private dataset sharing.

Load Rows

from datasets import load_dataset

ds = load_dataset("owner/team-traces", split="train")
print(ds[0])

For streaming:

from datasets import load_dataset

ds = load_dataset("owner/team-traces", streaming=True)
for row in ds["train"]:
    print(row)

Rows are workflow-specific. A command-trajectory eval dataset and a PR intent summary dataset will not have the same row schema, even if they came from the same bucket traces.

File-Oriented Access

For published Hugging Face datasets, hf-mount can expose shards as files:

hf-mount start repo datasets/your-org/agent-traces /mnt/traces
ls /mnt/traces/data/
head -n 1 /mnt/traces/data/*.jsonl
hf-mount stop /mnt/traces

For private or gated datasets, authenticate with Hugging Face first.

Resolve Back To Evidence

Rows should carry enough trace references for a consumer to retrieve bucket evidence when it is allowed:

opentraces trace get <trace-id> --remote owner/private-bucket --json
opentraces trail track <trace-id> --json
opentraces ctx step <trace-id> <step-index> --json

That lookup is separate from dataset loading. A public dataset can reference a private bucket without exposing the bucket itself.