docs / workflow / pushing

Dataset Publish

opentraces dataset publish <name> uploads approved workflow rows and contract files for a named dataset to its active HuggingFace remote. It never appends to an existing shard in place.

opentraces dataset review approve my-dataset --all
opentraces dataset remote create my-dataset owner/team-traces --private
opentraces dataset publish my-dataset --check-only
opentraces dataset publish my-dataset

Options

opentraces dataset publish my-dataset
opentraces dataset publish my-dataset --to owner/team-dataset
opentraces dataset publish my-dataset --check-only
opentraces dataset publish my-dataset --min-retention 0.5
opentraces dataset publish my-dataset --exclude-state lost --exclude-state never_committed
FlagDescription
--to TEXTRemote name or owner/name override
--check-onlyRun gates and stage without upload
--resume TEXTResume a previous publication run id
--min-retention FLOATDrop rows whose mean patch retention is below the threshold
--exclude-state TEXTDrop rows containing a patch with this survival state; repeatable
--jsonEmit structured JSON

Bucket Sync Is Separate

opentraces bucket remote push
opentraces bucket remote pull
opentraces bucket remote status

Bucket sync moves raw retained evidence. Dataset publish moves approved projected rows. A private bucket remote can exist even when no dataset has been published.

Security And Publication Gates

Publication gates operate on dataset rows. If a workflow requires sanitization or LLM review, it should run those steps before approving rows.

opentraces security tools list
opentraces security sanitize --tools regex,entropy
opentraces setup llm-review
opentraces dataset publish my-dataset --check-only

dataset publish --check-only also blocks any row that does not satisfy the dataset's required security tools (block reason required_security_tools_missing), alongside the existing review, security-version, and privacy gates. This check is keyed on per-row execution evidence: each row records the tools that actually ran over it (tools_applied, in row provenance), and the gate blocks the row if that set does not cover the required tools. So a row appended while a required tool was disabled stays blocked even if the tool is re-enabled afterward. The dataset's required tools come from its manifest policy; inspect or adjust them with opentraces dataset security <name>.

Rows without an approval state are filtered out. Gate failures surface in the CLI output and, in JSON mode, in the publication payload.

Upload Shape

Each publish creates a new shard:

data/
  rows_20260521T142300Z_a1b2c3d4.jsonl
  rows_20260521T151500Z_e5f6a7b8.jsonl

The dataset card and schema contract files are regenerated from the local models and row manifest.