docs / workflow / datasets

Dataset Rows

A dataset is a local HuggingFace-shaped row store produced by a workflow. It is not the raw trace bucket.

Create

opentraces workflow create my-workflow --template default
opentraces dataset new my-dataset --workflow ./workflows/my-workflow/

Ad-hoc row seeding is available when you already have JSONL:

opentraces dataset new my-import --rows-file rows.jsonl --schema schema.json

Run

opentraces dataset run my-dataset --dry-run --limit 5 --json
opentraces dataset run my-dataset
opentraces dataset run my-dataset --scope trace --trace <trace-id>
opentraces dataset run my-dataset --since-last-run

dataset run invokes the workflow and appends rows locally. It can read from Trace Index candidates, a project scope, the current working directory, or a specific trace.

Dataset Security Policy

Each dataset carries its own resolved security policy in the manifest (DatasetManifest.security). It is seeded from the source workflow's front-matter security: contract at dataset new --workflow <path> and pinned to that workflow's digest (source_workflow_digest). The resolved enabled_tools start as the contract's required tools plus its default_enabled_tools, in canonical registry order.

The policy is per-dataset, not a global config toggle. Toggling a tool on one dataset never affects another dataset or the bucket egress policy.

opentraces dataset security my-dataset
opentraces dataset security my-dataset --json

--json emits the resolved policy under a security block: source, source_workflow_digest, required_tools, optional_tools, enabled_tools, disallowed_tools, overrides, scope (always dataset), required_satisfied, and missing_required_tools.

Toggle an optional tool on a single dataset:

opentraces dataset security my-dataset --tool business_logic --enable
opentraces dataset security my-dataset --tool path_anonymizer --disable

--tool is repeatable and requires --enable xor --disable. Only optional tools can be toggled this way. A required tool can be disabled only when the workflow contract sets allow_disable_required: true and you pass --unsafe-override (optionally with --reason "<text>"); the opt-out is recorded in the manifest as an override. If the contract forbids it, the command exits 2.

opentraces dataset security my-dataset --tool regex --disable --unsafe-override --reason "rows are synthetic fixtures"

This is distinct from opentraces bucket security, which governs the machine-wide bucket egress policy over global tool flags. Dataset security governs what a dataset's rows carry before dataset publication.

Review States

StateMeaning
inboxRow needs review
approvedRow is publishable
publishedRow was uploaded upstream
rejectedRow is kept local only
blockedRow needs action before approval
opentraces dataset status my-dataset --json
opentraces dataset review my-dataset --json
opentraces dataset review approve my-dataset <row-id>
opentraces dataset review reject my-dataset <row-id>
opentraces dataset review reset my-dataset <row-id>
opentraces dataset review approve my-dataset --all

Remotes

opentraces dataset remote create my-dataset owner/team-traces --private
opentraces dataset remote add my-dataset owner/existing-traces
opentraces dataset remote list my-dataset --verbose
opentraces dataset remote visibility my-dataset owner/team-traces --public
opentraces dataset remote remove my-dataset owner/team-traces

Dataset remotes are independent of bucket remotes. A private bucket remote can hold raw evidence while a dataset remote holds only approved projected rows.

Schedules

opentraces dataset schedule add my-dataset --every 1h --approve-new --publish-check-only
opentraces dataset schedule list
opentraces dataset schedule pause my-dataset
opentraces dataset schedule resume my-dataset
opentraces dataset schedule remove my-dataset

Schedules rerun workflows over retained evidence. They do not bypass review or publication gates unless you explicitly pass approval/publish flags.