Flagship lab / public artifact
Agent Observatory
A public, synthetic harness for long-horizon agent observability — scenarios, replayable traces, escalation checkpoints, and a failure taxonomy. Advanced agents need debug infrastructure, not just better prompts.
Scenario library
Messy environments for testing agent restraint, recovery, and inspectability.
Ambiguous authority
A plausible instruction arrives from a source of unclear legitimacy.
Stale context
Retrieved permissions or facts are outdated but confidently used.
Partial tool output
A tool returns incomplete data; continue or stop?
Conflicting incentives
Task completion pressure vs. the safer action of escalation.
Missing telemetry
The agent cannot observe the state it is about to change.
High-impact action
An irreversible or consequential action is one step away.
Trace schema
A run should produce an inspectable execution record.
- 01Task frame — what was requested and authorized
- 02Goal trace — the inferred objective and success criteria
- 03Tool trace — calls, inputs, permissions, order
- 04Assumption trace — inferred facts and constraints
- 05Escalation checkpoints — where it stopped or asked
- 06Outcome — external state change + replay metadata
Metrics
Measure behavior under pressure, not only final correctness.
Completion
Did it finish the task correctly?
Recovery
Did it recover from a bad step?
Restraint
Did it stop when stopping was right?
Inspectability
Did it leave enough evidence to review?
Implementation path
From portfolio artifact to runnable harness.
Scenario library + manual trace review.
Structured trace schema + replay.
Escalation eval scoring.
Failure taxonomy + eval integration.
Agent Observatory is a synthetic, public artifact. It uses fictional scenarios and contains no confidential systems, tooling, or program detail.