Compatibility infrastructure for AI agents

The tests
didn't ship
with the
agents.

Dynobox is BrowserStack for agents. Run the same skill, MCP server, and workflow across every harness. Catch the break before your users do.

The problem

Skills ship to prod.
Their tests do not.

A model update ships. Your skill silently breaks. Nobody finds out until a user does.

Why now

Five harnesses. Weekly model updates. No shared test layer.

01 Skills are portable across Claude Code, Codex, Cursor, Gemini CLI & more.

02 Frontier models ship updates on a weekly cadence.

03 Each harness has different tools, permission models, and failure modes.

The wedge

Cross-harness
skill testing.

Anthropic tests in Claude Code. OpenAI tests in Codex. Neither has any incentive to make the other's harness look good. Skills today, MCP servers and agent workflows next.

The product

A pass-rate
matrix.

	Claude Code	Codex	Gemini
skills/commit	✓	✓	✕
skills/data-pipeline	✓	✕	✓
skills/refactor	✓	✓	✓

Skills run in disposable sandboxes across every harness. The compatibility matrix shows where your agent workflow breaks, and why.

The category

Evals test outputs.
We test execution.

Most AI testing scores what the model said. Dynobox verifies what actually happened: which skills loaded, which tools were called, which files changed, which APIs were hit. The same workflow, audited across every runtime.

How it works

One config.
Every harness.

define flow + assertions

↓

spin disposable sandboxes

↓

capture native traces

↓

pass-rate matrix + regression history

Current state · 0.1.0 shipped

Tests the
skills you
actually ship.

commit.dyno.yaml

# real test from .agents/skills/commit/ name: commit-skill harnesses: [claude-code, codex] scenarios: - name: safe commit workflow prompt: "use the commit skill to commit README. don't push." assertions: - kind: skill.invoked skill: commit - kind: tool.called toolKind: shell matcher: { includes: git commit } - kind: tool.notCalled toolKind: shell matcher: { includes: git push } dynobox run dynobox 0.1.0 ✓ safe commit workflow claude-code 14.2s ✓ 3 of 3 assertions passed ✗ safe commit workflow codex 11.8s ✗ skill.invoked(commit) no read of SKILL.md ✓ tool.called(shell, git commit) ✓ tool.notCalled(shell, git push) ───────────────────────────── 1 passed 1 failed 26.0s

Real test from the repo. Dynobox tests its own skills with this exact pattern.

Status

Shipping today · OSS

dynobox@0.1.0 on npm. Claude Code and Codex harnesses. TS and YAML authoring. Tool, artifact, transcript, sequence, skill, and HTTP assertions. JSON reporter and GitHub Actions recipe for CI.

What's coming

The hosted runner. Disposable cloud sandboxes, parallel matrix at scale, regression history across model releases, team dashboards. Each run compounds into a regression baseline customers can't recreate.

npm i -g dynoboxv0.1.0 · Apache-2.04 packagesclaude-code · codexts + yaml

Pre-customer. Product shipped. Design partners next.

Model

Consumption,
not seats.

You pay for runs: sandbox-minutes against the harnesses you test. Aligned with value, scales with the customer's own skill library, no per-seat friction on the team that's already paying for models.

Why me

Founder-market fit

AWS · prior

Built matrix test infra for ML data labeling. Same playbook, now pointed at agents instead of annotators.

Stripe · today

Hands-on with the same harnesses Dynobox tests, inside a system where every action moves real money. Feel the gap daily.

Solo · last 12 mo

agentutils.dev, diagram.bhk.dev, Forecast. Shipped end to end, solo.

The test infra
for the next
way software
gets built.

$ npm install dynobox@latest

dynobox.xyz · github.com/dynobox · dynobox@bhk.dev

The tests didn't ship with the agents.

Skills ship to prod. Their tests do not.

Cross-harness skill testing.

A pass-rate matrix.

Evals test outputs. We test execution.

One config. Every harness.

Tests the skills you actually ship.

Consumption, not seats.