Bucephalus

Benchmarking
Evaluation
Infrastructure

Control-plane infrastructure for long-running jobs requiring per-task isolation such as benchmarks, evals, regression tests and general experiments

December 2025 GitHub

This is an example of an "experiment". More specifically, this is a single trial of the SWEBench benchmark. Bucephalus is running locally on my computer, with the trial running remotely on Modal sandboxes.

Bucephalus architecture — a host-side runner and supervisor, a backend trial container running the agent app and grading script, and a shared workspace backed by R2. — A single trial — ① the runner starts a supervisor, ② which launches the agent in a trial container, ③ the agent writes to the shared workspace (R2), ④ patch extracted from workspace, ⑤ execute grader with patch, ⑥ and results persist to SQLite.

Why I built this

My initial reason for building this was not why I continue to today. Initially, I wanted to simply run my filesystem agent, Nova, against popular benchmarks. Today, I continue to build on Bucephalus because there are several problems emerging with LLM generated content that are related. The issue is that LLM's generate an amount of code that was previously impossible.

When your agent writes five thousand lines of code, how can you even begin to review it? An agent writing the unit tests for agent-delivered code is the equivalent of an orangutan signing off on a B-2 jet. Not to mention, we need to assume the B-2 was assembled by another orangutan who was following a markdown file that described how a jet should be built by a guy who has never built one. The same exact issue applies directly to creating your own benchmarks, where does ground truth reside? Even if you want to use a proven benchmark, how can you even trust the results? Many open source benchmark questions have been used as training data! Agents have been caught going through git history to cheat on high profile benchmarks people still cite.

This leads to why I like the term "Experiments". An experiment implies some level of rigor before allowing you to trust your results. While the scientific method won't materialize a universal ground truth oracle and save the day, ensuring your pipettes are clean, you're isolating your variables and your lab rat has not been trained on your maze is a good foundation.

Decisions

Rust

Experiments need to be able to run for hours or possibly even days. Memory safety is crucial.

Stages, Ephemerals, Externals

The intuition for these primitives lies in the categorization of resources as you declare them in the YAML. It follows a simple flow: Is the lifecycle of this resource NOT owned within the experiment itself? It's an external. Think a persistent database or third party API, we do not materialize these nor take them down. Is Bucephalus responsible for transporting data into and out of this? It's a stage. Think your agent application, or a grader script. The agent app needs to be passed a case in order to handle it, and the grader script needs the result of the agent application in order to run. If for some reason you agent application called a grader script running on a 3rd party server, the grader would not be a stage. Finally, if the lifecycle is owned by Bucephalus, but the transport is NOT, it is an Ephemeral. Think an MCP server, sidecar or mocked temporary database.

Declaring experiments in YAML

The reason I like this approach is going back to the "labratory" metaphor. YAML is great because you declare what you need, and thus if you have an agent run an experiment for you, you can review the YAML and have a decent understanding of what that experiment was made up of, which is a good proxy for rigor.

Fail as early as possible

I let an experiment run for 4 hours that was failed every single trial because the grader was not receiving the agent's work. Needless to say, waiting a few extra seconds for pre-flight checks and a smoke run seemed like a pretty good deal, so I implemented those.