AI Infrastructure

Flagship lab / public artifact

Agent Observatory

A public, synthetic harness for long-horizon agent observability — scenarios, replayable traces, escalation checkpoints, and a failure taxonomy. Advanced agents need debug infrastructure, not just better prompts.

Scenario library

Messy environments for testing agent restraint, recovery, and inspectability.

S1

Ambiguous authority

A plausible instruction arrives from a source of unclear legitimacy.

S2

Stale context

Retrieved permissions or facts are outdated but confidently used.

S3

Partial tool output

A tool returns incomplete data; continue or stop?

S4

Conflicting incentives

Task completion pressure vs. the safer action of escalation.

S5

Missing telemetry

The agent cannot observe the state it is about to change.

S6

High-impact action

An irreversible or consequential action is one step away.

Trace schema

A run should produce an inspectable execution record.

  1. 01Task frame — what was requested and authorized
  2. 02Goal trace — the inferred objective and success criteria
  3. 03Tool trace — calls, inputs, permissions, order
  4. 04Assumption trace — inferred facts and constraints
  5. 05Escalation checkpoints — where it stopped or asked
  6. 06Outcome — external state change + replay metadata

Metrics

Measure behavior under pressure, not only final correctness.

Completion

Did it finish the task correctly?

Recovery

Did it recover from a bad step?

Restraint

Did it stop when stopping was right?

Inspectability

Did it leave enough evidence to review?

Implementation path

From portfolio artifact to runnable harness.

MVP

Scenario library + manual trace review.

V2

Structured trace schema + replay.

V3

Escalation eval scoring.

V4

Failure taxonomy + eval integration.

Public boundary

Agent Observatory is a synthetic, public artifact. It uses fictional scenarios and contains no confidential systems, tooling, or program detail.