docs / workflow / quality

Assess

opentraces assess scores committed traces against five consumer-facing rubrics. Run it after committing, before you push:

opentraces assess

Scores are printed to the terminal. Low-scoring traces show which checks failed so you can decide whether to fix or push anyway. Assessment only runs against committed traces — run opentraces commit first if your inbox isn't empty.

You can also score and push in one step with opentraces push --assess, which uploads and embeds the scorecard in the HuggingFace dataset card. See Push for details.

How scoring works

Assessment is deterministic by default: every check is a Python function over the TraceRecord fields. No LLM calls, no external requests, no randomness. The same trace always produces the same score.

Each trace is scored against all five personas. Per-persona score is a weighted average of its individual checks (0-100%). Batch score is the average across traces.

The five personas

PersonaWhat it checksWho uses it
ConformanceSchema validity: trace IDs, content hash, timestamps, steps present, security scannedAnyone ingesting opentraces data
TrainingSFT readiness: alternating roles, tool_call/observation pairing, reasoning coverageModel fine-tuners
RLOutcome signals: committed flag or terminal_state, signal confidence, cost, model IDRLHF / reward modeling
AnalyticsObservability: cache hit rate, cost, duration, per-step timestampsInfra / cost dashboards
DomainDiscoverability: language ecosystem, dependencies, task description, VCS infoDataset search and filtering

Conformance

Structural checks that apply to every trace regardless of agent type:

CheckDescription
C1: schema_versionMatches current schema version
C2: trace_id formatValid UUID-like string (≥32 chars with dashes)
C3: content_hash64-character hex, present
C4: agent nameNon-empty agent identifier
C5: timestampsBoth timestamp_start and timestamp_end present
C6: steps presentAt least one step recorded
C7: security scannedsecurity.scanned = True

Training

Grounded in ADP (Agent Data Protocol) empirical requirements for SFT pipelines:

CheckDescription
T1: alternating rolesuser/agent steps alternate ≥90% of transitions (≥50% for conversation-turn sources)
T2: tool_call pairingEvery tool_call_id has a matching observation
T3: reasoning coveragereasoning_content present on agent steps
T4: data cleanlinessNo redaction markers in step content

RL

Checks the reward proxy signal appropriate to the agent's execution context:

CheckDescription
RL1: outcome signalcommitted=True for devtime agents; terminal_state or reward for runtime agents
RL2: signal confidencesignal_confidence is derived or annotated (not default)
RL3: cost signalestimated_cost_usd > 0 (differentiates traces for cost-aware RL)
RL4: model identifiedagent.model populated (needed for per-model policy training)

Analytics

Observability checks that differentiate opentraces from trace-level-only sources. Checks that require per-step data are automatically skipped for conversation_turn fidelity sources (e.g. Hermes imports), which only have session-level timestamps:

CheckDescription
A1: cache_hit_rateComputed and in [0.0, 1.0] (skipped for runtime)
A2: estimated_costestimated_cost_usd > 0
A3: total_durationtotal_duration_s > 0 (skipped for runtime)
A4: step timestampsTimestamps on >80% of steps (skipped for conversation_turn)
A5: token breakdownPer-step input_tokens and output_tokens present
A6: token consistencyStep-sum ≈ session total (within 10%)

Domain

Checks that enable HuggingFace dataset discovery and filtering:

CheckDescription
D1: language_ecosystemPopulated (skipped for runtime with no code-writing tool calls)
D2: dependenciesAt least one dependency when language detected
D3: task descriptionMeaningful task description (>10 chars)
D4: VCS infoenvironment.vcs.base_commit present (skipped for runtime)
D5: code snippetsAt least one snippet captured (skipped for runtime)
D6: attributionAttribution data present
D7: agent identityAgent name + version OR name alone for runtime sources

Fidelity-aware scoring

Some sources (like Hermes imports) provide conversation turns rather than individual API calls. Checks that require call-level data are automatically marked skipped for these sources and excluded from the weighted average. This prevents penalizing community datasets for structural limitations of the source format.

The step_fidelity field on each trace records this: "individual_api_call" (devtime) vs "conversation_turn" (Hermes, other community imports).

Gate thresholds

The gate blocks push when any persona falls below its threshold:

PersonaMin (any trace)Min (batch average)
Conformance70%80%
Training40%45%
RL40%
Analytics60%70%
Domain45%55%

Gate FAILING does not block push by default. It's a signal, not a hard stop — you can push a failing batch and the gate status will be visible in the dataset card. Use --gate to enforce hard blocking (coming soon).

Dataset card integration

When you push with --assess, scores are embedded in the HuggingFace dataset card as badges and a scorecard table, and written to YAML frontmatter as searchable keys. See Push for details.