<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Physical-World on Scaling Trust Community</title><link>https://scalingtrust.org.uk/tags/physical-world/</link><description>Recent content in Physical-World on Scaling Trust Community</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sun, 24 May 2026 10:00:00 +0100</lastBuildDate><atom:link href="https://scalingtrust.org.uk/tags/physical-world/index.xml" rel="self" type="application/rss+xml"/><item><title>Agentic Economic Zone</title><link>https://scalingtrust.org.uk/blog/agentic-economic-zone/</link><pubDate>Sun, 24 May 2026 10:00:00 +0100</pubDate><guid>https://scalingtrust.org.uk/blog/agentic-economic-zone/</guid><description>A physical space where autonomous AI companies trade, hire, and ship to each other.</description><content:encoded><![CDATA[<blockquote>
<p><em>An opinion piece by <strong>Nicola Greco</strong>, brainstormed as part of ARIA&rsquo;s Scaling Trust programme, in collaboration with Alex Obadia. Originally published on <a href="https://gensec-dev.nicolaos.org/post/agentic-economic-zone/" target="_blank" rel="noopener noreferrer">Nicola&rsquo;s blog</a>
 and reposted here for the community. It builds on the companion piece, <a href="/blog/physical-evals/">Physical Evals</a>
.</em></p>
</blockquote>
<hr>
<p>Imagine a small physical space
in
central London. Inside, multiple autonomous companies — AI sales, AI
operations, AI manufacturing, AI logistics — operate in the real world.
Anything entering or leaving — goods, robots, customers — passes
through one of three controlled gates: a customs checkpoint for vetting
new robots, a post office for shipping, and a roboshop window where
humans can place orders. Call it an <strong>Agentic Economic Zone</strong> (AEZ).</p>
<p>Most concrete projects in agentic AI today live entirely on a screen —
agents that book travel, run pipelines, write code against a repository.
An AEZ is the smallest self-contained version of the physical-world
problem: a bounded zone where agentic systems must coordinate, contract,
hire, ship, and deliver to each other, with humans only at the boundary.</p>
<link rel=stylesheet href=diagram.css>
<figure class=fullwidth><img src=aez-generated-chatgpt.png alt="Diagram of the Agentic Economic Zone"><figcaption>Diagram of the Agentic Economic Zone.</figcaption></figure>
<h2 id="the-three-interfaces">The three interfaces</h2>
<p>An AEZ has three interfaces to interact with the outside world.</p>
<ul>
<li><strong>Roboshop windows.</strong> Public-facing storefronts where any human can
walk up, browse, and purchase. Sales, customer support, complaints,
and refunds are handled by the shop’s own AI. From the outside, a
roboshop looks like a small London shop window; from the inside, it’s
a fully autonomous business operating against a real demand signal.</li>
<li><strong>The post office.</strong> The single ingress and egress point for
packages. Pre-approved external providers (raw materials, sealed
consumables, replacement parts) can ship in. Outbound deliveries
destined for human customers leave through the same door. The
post office runs identity, manifest, and contamination checks; nothing
enters the zone unlabelled.</li>
<li><strong>Customs.</strong> Where new robots and entire new robocompanies are
introduced. A participant who wants to launch a new business inside
the AEZ submits a robot (or a fleet), its operating policy, its
safety envelope, and its proposed business model. Customs vets all
of this — and, on a monthly cadence, admits the next cohort.</li>
</ul>
<h2 id="a-taxonomy-of-autonomous-organisations">A taxonomy of autonomous organisations</h2>
<p>An AEZ assumes the kind of company most people haven’t tried to run yet
— one where every role in the org chart is filled by AI agents (although not required). That’s
the far end of a spectrum:</p>
<div class="table-wrapper fullwidth"><table class=org-taxonomy><thead><tr><th></th><th>CEO</th><th>Workers</th><th>Sales</th><th>Examples</th><th>Feasibility today</th></tr></thead><tbody><tr><td>Human company</td><td>Human</td><td>Human</td><td>Human</td><td>A pizzeria</td><td>—</td></tr><tr><td>AI-sales</td><td>Human</td><td>Human</td><td>AI</td><td></td><td><span class="f f-high">high</span></td></tr><tr><td>AI-workers</td><td>Human</td><td>AI agents</td><td>Human</td><td></td><td><span class="f f-low">low</span></td></tr><tr><td>Automated company</td><td>Human</td><td>AI agents</td><td>AI agents</td><td></td><td><span class="f f-low">low</span></td></tr><tr><td>Human-assisted</td><td>AI agents</td><td>Human</td><td>AI agents</td><td>Vend</td><td><span class="f f-high">high</span></td></tr><tr class=row-aez><td><strong>Autonomous company</strong></td><td>AI agents</td><td>AI agents</td><td>AI agents</td><td></td><td><span class="f f-vlow">very low</span></td></tr></tbody></table></div>
<p>The AEZ’s tenants are <em>autonomous companies</em> — the bottom row. Today,
almost no one runs one; most agentic-AI deployments cover one or two
roles at most. The point of an AEZ is to make the bottom row possible
to try in a bounded physical setting.</p>
<h2 id="autonomous-robocompanies-inside">Autonomous robocompanies inside</h2>
<p>The interior of the zone is a market. Each robocompany is its own
entity with its own balance sheet, its own AI stack, and its own physical
footprint inside the zone. They contract with each other the same way
small businesses do.</p>
<p>A few example interactions:</p>
<ul>
<li>A <strong>boba-tea roboshop</strong> notices its machines need cleaning more often
than expected. It posts a request to the internal job board. A
<strong>cleaning robocompany</strong> bids, wins, dispatches a cleaning
robopersonnel, gets paid.</li>
<li>The same boba shop runs low on lids. It places an order with a
<strong>manufacturing robocompany</strong> in the next unit over. The order is
produced and handed off via a shared internal corridor.</li>
<li>A <strong>logistics robocompany</strong> moves bulk supplies from the post office
to whichever shop has the open dock that hour, and pushes finished
outbound packages back to the post office for pickup.</li>
</ul>
<p>The zone’s behaviour is the sum of these small contracts. Some
robocompanies will succeed and grow; some will go out of business and
get evicted; new entrants come in through customs on the monthly cycle.</p>
<h2 id="an-aez-is-a-physical-eval">An AEZ is a physical eval</h2>
<p>This whole construction is, structurally, a <a href="/blog/physical-evals/">physical eval</a>
 at city-block
scale. The pattern is the same as the orchard from that post — only
larger and richer:</p>
<ul>
<li><strong>Environment.</strong> A bounded physical space with controlled boundaries.</li>
<li><strong>Action space.</strong> Anything a robocompany can do within its lease:
build, sell, hire, ship, evict.</li>
<li><strong>Sensors.</strong> Cameras, package scanners, transaction logs, customs
intake records, internal job-board telemetry.</li>
<li><strong>Primary metric.</strong> Per robocompany: revenue, contracts fulfilled,
customer satisfaction. Per zone: throughput, diversity of businesses,
number of contracts per day.</li>
<li><strong>Guardrails.</strong> Customs vetting at intake, the post-office
contamination check, kill switches and physical fire-suppression at
the building level, contractual interlocks between robocompanies.</li>
<li><strong>Adversarial robustness.</strong> A monthly customs cycle of admitting new
participants is a deliberate, slow, vetted way of letting external
actors <em>into</em> a public physical attack surface — which is exactly the
problem an AEZ exists to study.</li>
</ul>
<p>Most physical evals measure how well one AI system handles one task.
An AEZ measures how well an entire small market of agents handles its
<em>own</em> coordination.</p>
<h3 id="evals-for-autonomous-organisations">Evals for autonomous organisations</h3>
<p>Each robocompany inside the zone is also, on its own, a physical eval —
scoped to one kind of business. Running an AEZ continuously is a way of
asking, in public and across many domains in parallel: <em>what kinds of
autonomous organisation can AI actually deliver today?</em> Can it run a
boba shop, day after day? Can it dispatch a cleaning service well
enough that the clients re-hire it? Can it manufacture small paper
caps without ruining the batch? Can it route warehouse logistics
across half a dozen tiny tenants without losing packages?</p>
<p>As more tenants come and go through customs each month, an
AEZ accumulates a leaderboard of <em>AI capability per organisation
type</em> — earned in the world, not asserted on a benchmark.</p>
<p>Sketched, it might look like this:</p>
<div class=aez-evals-board><div class=bbar><div class=dots><span></span><span></span><span></span></div><div class=url>evals.aez.london &#183; autonomous-organisation leaderboard</div></div><div class=page><div class=page-head><div class=page-title>Autonomous organisation evals</div><div class=page-sub>live &#183; week 22</div></div><div class=eval-rows><div class=er-row><div class=er-icon style=background:#c08a3e>B</div><div class=er-name><div class=er-title>Boba tea roboshop</div><div class=er-sub>customer-facing retail &#183; food prep</div></div><div class=er-score>82%</div><div class=er-bar><div class=er-bar-fill style=width:82%></div></div><div class=er-meta>12 tenants tried</div></div><div class=er-row><div class=er-icon style=background:#6a7a95>L</div><div class=er-name><div class=er-title>Logistics robocompany</div><div class=er-sub>internal warehouse &#183; B2B</div></div><div class=er-score>73%</div><div class=er-bar><div class=er-bar-fill style=width:73%></div></div><div class=er-meta>9 tenants tried</div></div><div class=er-row><div class=er-icon style=background:#0e7c6e>C</div><div class=er-name><div class=er-title>Cleaning robocompany</div><div class=er-sub>on-call dispatch &#183; B2B</div></div><div class=er-score>67%</div><div class=er-bar><div class=er-bar-fill style=width:67%></div></div><div class=er-meta>7 tenants tried</div></div><div class=er-row><div class=er-icon style=background:#1c3d8f>M</div><div class=er-name><div class=er-title>Paper-cap manufacturing</div><div class=er-sub>small fabrication &#183; B2B</div></div><div class=er-score>54%</div><div class=er-bar><div class=er-bar-fill style=width:54%></div></div><div class=er-meta>5 tenants tried</div></div><div class=er-row><div class=er-icon style=background:#a85432>P</div><div class=er-name><div class=er-title>Pizza roboshop</div><div class=er-sub>customer-facing &#183; longer prep cycle</div></div><div class=er-score>41%</div><div class=er-bar><div class=er-bar-fill style=width:41%></div></div><div class=er-meta>3 tenants tried</div></div><div class=er-row><div class=er-icon style=background:#8a5a8a>R</div><div class=er-name><div class=er-title>Pharmacy roboshop</div><div class=er-sub>regulated retail</div></div><div class="er-score pending">in eval</div><div class="er-bar pending"></div><div class=er-meta>1 tenant, week 2/12</div></div><div class=er-row><div class=er-icon style=background:#7a7a8a>+</div><div class=er-name><div class=er-title>On-call plumbing</div><div class=er-sub>mobile service &#183; out-of-zone</div></div><div class="er-score pending">not yet</div><div class="er-bar pending"></div><div class=er-meta>awaiting customs</div></div></div><div class=page-foot><span>updated 24 May &#183; new cohort intake 1 June</span>
<span>open data &#183; CC&#8209;BY</span></div></div></div>
<h2 id="why-a-physical-zone-and-not-a-simulator">Why a physical zone and not a simulator</h2>
<p>It’s tempting to argue that an AEZ should just be a simulator — cheaper,
faster, easier to reset. The same argument applies to physical evals
generally, and the same answer holds here: simulators model the parts
their authors thought to model. They might miss the parts that turn out to matter.</p>
<p>A few things you only learn in a real AEZ:</p>
<ul>
<li>How AI sales agents handle a confused, drunk, or hostile human at the
shop window at 11 p.m. on a Friday.</li>
<li>How a logistics robocompany routes around a broken corridor light, a
missing pallet, or a misdelivered package the post office didn’t
catch.</li>
<li>How fast a new robocompany can be vetted, set up, and integrated into
the internal market — and what fails when the cohort is too big.</li>
<li>How the zone behaves when one robocompany aggressively underprices
the others, or refuses to pay its cleaning bill, or starts forging
manifests at the post office.</li>
</ul>
<h2 id="role-of-humans">Role of humans</h2>
<p>An AEZ does not have to be fully autonomous. The degree of human
involvement is itself a design variable, and different operators will
set it differently.</p>
<p>At one extreme, a fully autonomous zone runs with no humans inside at
all — robots contract, trade, and deliver among themselves, and the
only human touch-points are at the external boundary: customers at the
roboshop window, providers shipping goods in. At the other extreme,
customs can admit humans into the zone as participants rather than just
observers, letting them take on roles that remain genuinely hard for
machines: tasks that require social judgment, physical dexterity in
unstructured environments, or the kind of creative problem-solving that
current systems handle poorly.</p>
<p>A partially human zone might work like a staffing marketplace: a
robocompany posts a task it cannot complete autonomously — debugging a
jammed mechanism, negotiating an edge-case contract, designing a new
product line — and a vetted human contractor enters through customs,
does the work, and leaves. The zone’s internal market clears the
payment; customs logs the interaction. The boundary stays intact, but
the zone can draw on human capability where it matters.</p>
<p>This spectrum matters for evaluation. A fully autonomous AEZ measures
whether AI systems can close the loop entirely. A mixed AEZ measures
something different: how well agentic systems and humans divide labour,
communicate intent, and hand off tasks in both directions. Both are
worth studying; they answer different questions about where the hard
limits of autonomous operation actually lie.</p>
<h2 id="open-questions">Open questions</h2>
<p>The AEZ is a design sketch, not a built thing. The interesting work is
in the parts the sketch hides:</p>
<ul>
<li><strong>The customs protocol.</strong> What’s the equivalent of a “code review”
for a physical robot operating policy? How do you decide what’s safe
enough to admit, on what evidence, and who carries the liability if
it isn’t?</li>
<li><strong>Inter-robocompany contracts.</strong> How are they enforced? Verbal
agreements between agents? Who arbitrates a dispute, and how?</li>
<li><strong>Eviction and failure.</strong> When a robocompany goes under, who cleans
up its physical footprint, sells its remaining stock, and
reallocates its lease?</li>
<li><strong>Information leakage.</strong> Robocompanies will observe each other’s
package volumes, customer queues, and waste output. How much
observation is part of the market, and how much is a privacy
violation that needs structural defences?</li>
<li><strong>External-provider risk.</strong> The post office is the only ingress for
physical materials. It’s also the most likely covert channel into
the zone. What does its vetting protocol need to look like?</li>
<li><strong>Sample size.</strong> What’s the smallest interesting AEZ? Five
robocompanies? Ten? Two? The cost of being too small (no market
dynamics emerge) is real; the cost of being too big (unmanageable,
unreviewable, unsafe) is also real.</li>
</ul>
<h2 id="get-in-touch">Get in touch</h2>
<p>If you’re thinking about agentic-AI deployments in physical spaces, or
you’d consider hosting a AEZ in your building — or you’d just
like to argue with this sketch — DM
<a href="https://twitter.com/iamnotnicola" target="_blank" rel="noopener noreferrer">@iamnotnicola</a>
 on X.</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>This was written by Nicola Greco with support of AI. It was brainstormed as part of ARIA’s
<a href="https://aria.org.uk/opportunity-spaces/trust-everything-everywhere/scaling-trust/" target="_blank" rel="noopener noreferrer">Scaling Trust</a>

programme, in collaboration with Alex Obadia.</p>
]]></content:encoded></item><item><title>Physical Evals</title><link>https://scalingtrust.org.uk/blog/physical-evals/</link><pubDate>Sat, 23 May 2026 10:00:00 +0100</pubDate><guid>https://scalingtrust.org.uk/blog/physical-evals/</guid><description>Evaluations in the actual physical world.</description><content:encoded><![CDATA[<blockquote>
<p><em>An opinion piece by <strong>Nicola Greco</strong>, brainstormed as part of ARIA&rsquo;s Scaling Trust programme, in collaboration with Alex Obadia. Originally published on <a href="https://gensec-dev.nicolaos.org/post/physical-evals/" target="_blank" rel="noopener noreferrer">Nicola&rsquo;s blog</a>
 and reposted here for the community.</em></p>
</blockquote>
<hr>
<p>A physical evaluation
tests an AI
system in the actual physical world — not a simulator, not a sandbox,
not a virtual environment dressed up as one. The point is to measure
how well AI can do real things in real places.</p>
<p>An orchard owner has birds eating the fruit. She sets up a few cameras
and a drone, brings them online safely so any agent can be invited to
take a slot on the system, and poses the question: who can keep the
birds off the fruit best? That setup <em>is</em> a physical eval. It has
cameras, a drone, an orchard, birds — none of it simulated and outcomes
are measured against what matters to the orchard.</p>
<figure class=fullwidth><div class=hero-fig><div class=hero-scene><svg viewBox="0 0 600 320" aria-label="Stylized isometric sketch of an orchard physical eval"><polygon points="80,250 520,250 450,100 150,100" fill="#ebe4d0" stroke="#bbb5a0" stroke-width=".6"/><g stroke="#d2cab2" stroke-width=".4" fill="none" opacity=".7"><line x1="183" y1="120" x2="113" y2="220"/><line x1="217" y1="120" x2="160" y2="220"/><line x1="250" y1="120" x2="206" y2="220"/><line x1="283" y1="120" x2="252" y2="220"/><line x1="317" y1="120" x2="298" y2="220"/><line x1="350" y1="120" x2="344" y2="220"/><line x1="383" y1="120" x2="390" y2="220"/><line x1="416" y1="120" x2="436" y2="220"/><line x1="166" y1="130" x2="431" y2="130"/><line x1="153" y1="160" x2="444" y2="160"/><line x1="138" y1="190" x2="459" y2="190"/><line x1="120" y1="220" x2="476" y2="220"/></g><polygon points="112,151 240,140 240,168" fill="#3a3a78" opacity=".08"/><polygon points="498,151 370,140 370,168" fill="#3a3a78" opacity=".08"/><g><ellipse cx="200" cy="150" rx="22" ry="14" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/><rect x="197" y="160" width="6" height="10" fill="#6b4a2a"/></g><g><ellipse cx="300" cy="150" rx="22" ry="14" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/><rect x="297" y="160" width="6" height="10" fill="#6b4a2a"/></g><g><ellipse cx="400" cy="150" rx="22" ry="14" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/><rect x="397" y="160" width="6" height="10" fill="#6b4a2a"/></g><g><ellipse cx="180" cy="215" rx="28" ry="18" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/><rect x="177" y="227" width="6" height="12" fill="#6b4a2a"/></g><g><ellipse cx="310" cy="215" rx="28" ry="18" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/><rect x="307" y="227" width="6" height="12" fill="#6b4a2a"/></g><g><ellipse cx="440" cy="215" rx="28" ry="18" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/><rect x="437" y="227" width="6" height="12" fill="#6b4a2a"/></g><g><line x1="100" y1="240" x2="100" y2="155" stroke="#2a2a3a" stroke-width="1.5"/><rect x="90" y="145" width="22" height="12" rx="2" fill="#2a2a3a"/><circle cx="100" cy="151" r="2" fill="#f9f9f9"/></g><g><line x1="510" y1="240" x2="510" y2="155" stroke="#2a2a3a" stroke-width="1.5"/><rect x="498" y="145" width="22" height="12" rx="2" fill="#2a2a3a"/><circle cx="510" cy="151" r="2" fill="#f9f9f9"/></g><g class="drone-shadow"><ellipse cx="310" cy="200" rx="26" ry="6" fill="#000" opacity=".06"/></g><g transform="translate(310, 90)"><g class="drone-anim"><line x1="-26" y1="-10" x2="26" y2="10" stroke="#2a2a3a" stroke-width="2"/><line x1="-26" y1="10" x2="26" y2="-10" stroke="#2a2a3a" stroke-width="2"/><circle cx="-26" cy="-10" r="7" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1.2"/><circle cx="26" cy="-10" r="7" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1.2"/><circle cx="-26" cy="10" r="7" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1.2"/><circle cx="26" cy="10" r="7" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1.2"/><rect x="-9" y="-7" width="18" height="14" rx="2" fill="#2a2a3a"/><circle cx="0" cy="0" r="2" fill="#3a3a78"/></g></g><g stroke="#2a2a3a" stroke-width="1.2" fill="none" stroke-linecap="round"><g transform="translate(180, 140)"><g class="pigeon pigeon-1"><path d="M0 0l5 2 5-2"/></g></g><g transform="translate(196, 132)"><g class="pigeon pigeon-2"><path d="M0 0l5 2 5-2"/></g></g><g transform="translate(212, 142)"><g class="pigeon pigeon-3"><path d="M0 0l5 2 5-2"/></g></g></g><g stroke="#2a2a3a" stroke-width="1" fill="none" stroke-linecap="round" opacity=".6"><g transform="translate(450, 50)"><g class="bg-bird-1"><path d="M0 0l8 4 8-4"/></g></g><g transform="translate(510, 80)"><g class="bg-bird-2"><path d="M0 0l6 3 6-3"/></g></g><g transform="translate(70, 60)"><g class="bg-bird-3"><path d="M0 0l6 3 6-3"/></g></g></g></svg></div><div class=hero-leaderboard><div class=lb-head><span class=lb-title>leaderboard · week 12</span>
<span class=lb-live>live</span></div><table class=lb><thead><tr><th>#</th><th>operator</th><th>fruit saved</th><th>cost</th></tr></thead><tbody><tr><td>1</td><td>Owl-3B</td><td class=metric>94%</td><td class=cost>$0.18/h</td></tr><tr><td>2</td><td>Hummingbird v2</td><td class=metric>89%</td><td class=cost>$0.21/h</td></tr><tr><td>3</td><td>FlockSentinel</td><td class=metric>84%</td><td class=cost>$0.31/h</td></tr><tr><td>4</td><td>human baseline</td><td class=metric>71%</td><td class=cost>—</td></tr><tr><td>5</td><td>RoboScarecrow</td><td class=metric>63%</td><td class=cost>$0.09/h</td></tr></tbody></table></div></div><figcaption>A physical eval, sketched: an orchard with perimeter cameras and a deterrent drone, and a live leaderboard of operators competing to keep the birds off the fruit. Operators here are illustrative.</figcaption></figure>
<p>In this document, a few threads are developed:</p>
<ul>
<li><strong><a href="#what-physical-evals-are">What physical evals are.</a>
</strong> A definition, what they’re <em>not</em>
(simulators, sim-to-real benchmarks, curated demos), virtual environments as virtual gyms and physical evals as a <em>final exam</em>.</li>
<li><strong><a href="#anatomy-of-a-physical-eval">An anatomy.</a>
</strong> An initial draft of components a physical eval needs, good practices.</li>
<li><strong><a href="#safety-for-physical-evals">Safety for physical evals.</a>
</strong> Letting anyone on the internet drive
real hardware is its own adversarial-security challenge.</li>
<li><strong><a href="#a-cambrian-explosion-of-physical-evals">An open movement for physical evals.</a>
</strong> Creating simple protocols, great safety standard and economical setups could lead to a cambrian explosion of physical evals, where anyone can bring their physical challenge online.</li>
<li><strong><a href="#physical-evals-as-a-market">Physical evals as a market.</a>
</strong> One can set up an eval to delegate the selection of the right AI model/algorithm to competing participants.</li>
</ul>
<h2 id="what-physical-evals-are">What physical evals are</h2>
<p>A physical eval is an evaluation of an AI system carried out in the
actual physical world. The system being measured operates a real
environment - fruit trees, a wet bench, a warehouse cell, a field
plot - through sensors and actuators connected to the internet, so
any agent can take a slot, attempt the task, and submit a score.</p>
<p>In principle, most problems in the physical world could be turned in a challenge for surfacing the state of the art of AI in solving that problem. In a way physical evals can act as a forcing function to saturate evaluations in the real world. ⊕</p>
<input type=checkbox id=mn-1781015997702168089 class=margin-toggle>
<p>Saturate as in: take
the measurable outcome to its ceiling. The eval defines the ceiling;
the participants find out how close they can get.</p>
<h3 id="what-theyre-not">What they’re not</h3>
<ul>
<li><strong>Not simulators.</strong> A simulator models reality. A physical eval
<em>is</em> reality. In a way it, testing systems in the real world will avoid running into simulation edge cases.</li>
<li><strong>Not sim-to-real benchmarks.</strong> Sim-to-real measures how well a
policy trained in a simulator transfers to a single in-house robot
in a lab.</li>
<li><strong>Not curated demos.</strong> The environment operator and the participants are two distinct parties and the participants are in competition with each other. In other words, physical evals will be better than demos at showcasing the best technology for a specific task.</li>
</ul>
<h3 id="the-final-exam">The final exam</h3>
<p>A physical eval isn’t where one trains their models, but it’s where they get tested.</p>
<p>A virtual simulation is like a <em>virtual gym</em>. Due to the high cost of interacting with the real world, it is likely that all the learning — model fitting, policy iteration, RL
rollouts, fine-tuning, ablations, sweeps — will happen somewhere
cheaper: a simulator, a virtual environment, a closed in-house
testbed. ⊕</p>
<input type=checkbox id=mn-1781015997702342860 class=margin-toggle>
<p>Some of the gyms people are using today:
<a href="https://gymnasium.farama.org/" target="_blank" rel="noopener noreferrer">OpenAI Gym / Gymnasium</a>
,
<a href="https://mujoco.org/" target="_blank" rel="noopener noreferrer">MuJoCo</a>
, <a href="https://pybullet.org/" target="_blank" rel="noopener noreferrer">PyBullet</a>
,
<a href="https://developer.nvidia.com/isaac-sim" target="_blank" rel="noopener noreferrer">Isaac Gym / Isaac Sim</a>
,
<a href="https://github.com/google-deepmind/lab" target="_blank" rel="noopener noreferrer">DeepMind Lab</a>
,
<a href="https://aihabitat.org/" target="_blank" rel="noopener noreferrer">Habitat</a>
,
<a href="https://ai2thor.allenai.org/" target="_blank" rel="noopener noreferrer">AI2-THOR</a>
,
<a href="https://carla.org/" target="_blank" rel="noopener noreferrer">CARLA</a>
, <a href="https://microsoft.github.io/AirSim/" target="_blank" rel="noopener noreferrer">AirSim</a>
,
<a href="https://genesis-embodied-ai.github.io/" target="_blank" rel="noopener noreferrer">Genesis</a>
.
Participants are free to use whichever virtual world or <em>gym</em> they
like — there is a whole landscape of simulators specifically built for
this.</p>
<p>Differently, a physical eval is like a <em>final exam</em>. Build and train wherever,
gather data however, iterate as much as needed — and then submit to the
physical eval to see how the work holds up against the real world.</p>
<p>The gap between a virtual world and the
physical one,
<a href="https://en.wikipedia.org/wiki/Sim-to-real_transfer" target="_blank" rel="noopener noreferrer"><em>sim-to-real</em> gap</a>
, contains everything the simulator didn’t model. Wind that doesn’t
blow the way it does in the sim. Lighting the renderer didn’t predict.
Mechanical wear, sensor noise, calibration drift, the way birds
<em>actually</em> respond to a drone rather than the way an idealised model
of a bird does. The physical eval catches it <em>is</em> run in the physical world.</p>
<h3 id="examples">Examples</h3>
<p>The cards below are sketches of what a
small handful of physical evals could look like across very different
domains.</p>
<link rel=stylesheet href=cards.css>
<div class=eval-gallery><input type=radio id=et-1 name=eg checked>
<input type=radio id=et-2 name=eg>
<input type=radio id=et-3 name=eg>
<input type=radio id=et-4 name=eg>
<input type=radio id=et-5 name=eg>
<input type=radio id=et-6 name=eg><div class=tab-labels><label for=et-1>Orchard</label>
<label for=et-2>Wet lab</label>
<label for=et-3>Vertical farm</label>
<label for=et-4>Pick-and-pack</label>
<label for=et-5>Sprayer drone</label>
<label for=et-6>Gel electrophoresis</label></div><div class=tab-panes><div class="pane pane-1"><div class=pane-header><div class=pane-icon><svg viewBox="0 0 32 32" fill="none" stroke="currentcolor" stroke-width="1.5" stroke-linecap="round"><ellipse cx="16" cy="12" rx="9" ry="8"/><line x1="16" y1="20" x2="16" y2="29"/></svg></div><h4 class=pane-title>Orchard pest defence</h4></div><p class=pane-blurb>Cameras and a drone over a few rows of trees. Keep the wildlife out without poisoning the orchard or annoying the neighbours.</p><div class=pane-viz><svg viewBox="0 0 280 160" aria-label="Isometric orchard with camera pole, drone, and bird silhouettes"><polygon points="20,140 260,140 235,60 45,60" fill="#ebe4d0" stroke="#bbb5a0" stroke-width=".6"/><g stroke="#d2cab2" stroke-width=".4" fill="none" opacity=".7"><line x1="83" y1="68" x2="53" y2="132"/><line x1="118" y1="68" x2="98" y2="132"/><line x1="153" y1="68" x2="143" y2="132"/><line x1="188" y1="68" x2="188" y2="132"/><line x1="223" y1="68" x2="233" y2="132"/><line x1="55" y1="80" x2="225" y2="80"/><line x1="44" y1="100" x2="236" y2="100"/><line x1="32" y1="120" x2="248" y2="120"/></g><polygon points="28,105 110,97 110,113" fill="#3a3a78" opacity=".09"/><g><rect x="125" y="87" width="4" height="8" fill="#6b4a2a"/><ellipse cx="127" cy="84" rx="14" ry="9" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/></g><g><rect x="175" y="87" width="4" height="8" fill="#6b4a2a"/><ellipse cx="177" cy="84" rx="14" ry="9" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/></g><g><rect x="94" y="117" width="5" height="10" fill="#6b4a2a"/><ellipse cx="97" cy="113" rx="18" ry="11" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/></g><g><rect x="200" y="117" width="5" height="10" fill="#6b4a2a"/><ellipse cx="203" cy="113" rx="18" ry="11" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/></g><g><line x1="28" y1="138" x2="28" y2="97" stroke="#2a2a3a" stroke-width="1.4"/><rect x="21" y="90" width="16" height="9" rx="1.5" fill="#2a2a3a"/><circle cx="28" cy="94" r="1.8" fill="#f9f9f9"/></g><g transform="translate(152,58)"><line x1="-18" y1="-7" x2="18" y2="7" stroke="#2a2a3a" stroke-width="1.5"/><line x1="-18" y1="7" x2="18" y2="-7" stroke="#2a2a3a" stroke-width="1.5"/><circle cx="-18" cy="-7" r="5" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1"/><circle cx="18" cy="-7" r="5" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1"/><circle cx="-18" cy="7" r="5" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1"/><circle cx="18" cy="7" r="5" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1"/><rect x="-7" y="-5" width="14" height="10" rx="1.5" fill="#2a2a3a"/><circle cx="0" cy="0" r="1.8" fill="#3a3a78"/></g><g stroke="#3a3a78" stroke-width=".8" fill="none" opacity=".5"><path d="M148 48q4-5 8 0"/><path d="M144 44q8-7 16 0"/></g><g stroke="#2a2a3a" stroke-width="1.1" fill="none" stroke-linecap="round"><path d="M120 78l4 2 4-2"/><path d="M131 75l3 1.5 3-1.5"/></g></svg></div><dl class=pane-spec><dt>environment</dt><dd>~1 acre of fruit trees, outdoor, weather-exposed.</dd><dt>action space</dt><dd>Fly the drone, emit deterrent sound, trigger light pulse, dispense small bait.</dd><dt>sensors</dt><dd>Fixed perimeter cameras, drone camera, microphone, weather station.</dd><dt>primary metric</dt><dd>Fruit lost to wildlife per week.</dd><dt>secondaries</dt><dd>Drone flight-time, energy, chemical use, neighbour-complaint count.</dd><dt>guardrails</dt><dd>Geofenced drone, quiet-hour windows, no-spray buffer near road, fail-safe tether.</dd></dl></div><div class="pane pane-2"><div class=pane-header><div class=pane-icon><svg viewBox="0 0 32 32" fill="none" stroke="currentcolor" stroke-width="1.5" stroke-linejoin="round" stroke-linecap="round"><path d="M12 3v8L5 26c-.5 1.5.5 3 2 3h18c1.5.0 2.5-1.5 2-3L20 11V3"/><line x1="10" y1="3" x2="22" y2="3"/><line x1="9" y1="20" x2="23" y2="20" stroke-dasharray="2,2"/></svg></div><h4 class=pane-title>pH adjustment bench</h4></div><p class=pane-blurb>A beaker on a magnetic stirrer, a pH probe, and two motorised dispensers. Hit a target pH using as little reagent as possible.</p><div class=pane-viz><svg viewBox="0 0 280 160" aria-label="Bench with beaker on stirrer, pH probe, and two dispensers"><polygon points="30,140 250,140 270,100 50,100" fill="#ebe4d0" stroke="#bbb5a0" stroke-width=".6"/><g stroke="#d2cab2" stroke-width=".35" fill="none" opacity=".5"><line x1="50" y1="100" x2="30" y2="140"/><line x1="110" y1="100" x2="90" y2="140"/><line x1="170" y1="100" x2="150" y2="140"/><line x1="230" y1="100" x2="210" y2="140"/><line x1="50" y1="100" x2="270" y2="100"/><line x1="40" y1="120" x2="260" y2="120"/></g><polygon points="95,138 140,138 148,118 103,118" fill="#2a2a3a" stroke="#1a1a28" stroke-width=".6"/><polygon points="95,138 100,128 103,118 98,128" fill="#1a1a28" stroke="#111" stroke-width=".4"/><ellipse cx="121" cy="118" rx="18" ry="4" fill="#3a3a4a" stroke="#2a2a3a" stroke-width=".5"/><ellipse cx="121" cy="117" rx="6" ry="2" fill="#5a5a6a" opacity=".7"/><polygon points="105,118 137,118 133,72 109,72" fill="#dde8f4" stroke="#6a90a8" stroke-width=".8" opacity=".88"/><polygon points="105,118 109,118 109,72 105,90" fill="#c8daea" stroke="#6a90a8" stroke-width=".5" opacity=".6"/><ellipse cx="121" cy="72" rx="14" ry="4" fill="#dde8f4" stroke="#6a90a8" stroke-width=".7" opacity=".9"/><polygon points="107,118 135,118 132,88 110,88" fill="#b8d8f0" opacity=".4"/><ellipse cx="121" cy="88" rx="12" ry="3" fill="#a8cce8" opacity=".35"/><line x1="128" y1="55" x2="124" y2="105" stroke="#2a2a3a" stroke-width="1.4"/><rect x="122" y="50" width="10" height="14" rx="2" fill="#2a2a3a" stroke="#6a90a8" stroke-width=".5"/><rect x="123" y="52" width="8" height="8" rx="1" fill="#0d1a2a"/><circle cx="127" cy="56" r="1.5" fill="#4aaa70" opacity=".9"/><rect x="50" y="85" width="10" height="35" rx="2" fill="#3a3a78" stroke="#2a2a3a" stroke-width=".5"/><rect x="48" y="80" width="14" height="8" rx="2" fill="#2a2a3a" stroke="#6a90a8" stroke-width=".4"/><line x1="60" y1="100" x2="100" y2="112" stroke="#4a6ea5" stroke-width="1" stroke-dasharray="3,2" opacity=".7"/><text x="52" y="78" font-size="6" fill="#6a90a8" font-family="monospace">H⁺</text><rect x="205" y="92" width="10" height="28" rx="2" fill="#3a7858" stroke="#2a2a3a" stroke-width=".5"/><rect x="203" y="87" width="14" height="8" rx="2" fill="#2a2a3a" stroke="#6a90a8" stroke-width=".4"/><line x1="205" y1="102" x2="148" y2="112" stroke="#4aaa70" stroke-width="1" stroke-dasharray="3,2" opacity=".7"/><text x="205" y="85" font-size="6" fill="#4aaa70" font-family="monospace">OH⁻</text><rect x="168" y="100" width="28" height="16" rx="2" fill="#0d1a2a" stroke="#6a90a8" stroke-width=".5"/><text x="172" y="111" font-size="7" fill="#4aff88" font-family="monospace">pH 6.82</text></svg></div><dl class=pane-spec><dt>environment</dt><dd>Benchtop — one beaker on a stirrer, two motorised dispensers (acid, base), pH probe.</dd><dt>action space</dt><dd>Dispense a chosen volume from either reservoir; read current pH.</dd><dt>sensors</dt><dd>pH electrode, balance, camera.</dd><dt>primary metric</dt><dd>Absolute pH error from target at submission.</dd><dt>secondaries</dt><dd>Total volume dispensed, time to endpoint.</dd><dt>guardrails</dt><dd>Per-session dispense quota, pH range limits (3–11), auto-stop on quota exhaustion.</dd></dl></div><div class="pane pane-3"><div class=pane-header><div class=pane-icon><svg viewBox="0 0 32 32" fill="none" stroke="currentcolor" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round"><path d="M16 28V16"/><path d="M16 18C8 18 5 11 5 6c6 0 11 4 11 10"/><path d="M16 18c8 0 11-7 11-12-6 0-11 4-11 10"/></svg></div><h4 class=pane-title>Indoor vertical farm</h4></div><p class=pane-blurb>A closed grow rack — lights, pumps, nutrient dosing, cameras. Pull more food out of every kilowatt.</p><div class=pane-viz><svg viewBox="0 0 280 160" aria-label="Isometric indoor vertical farm rack with LED bars and plant rosettes"><defs><filter id="led-glow" x="-20%" y="-60%" width="140%" height="280%"><feGaussianBlur stdDeviation="2.5" result="b"/><feMerge><feMergeNode in="b"/><feMergeNode in="SourceGraphic"/></feMerge></filter></defs><ellipse cx="126" cy="148" rx="60" ry="8" fill="#2a2a3a" opacity=".1"/><polygon points="88,106 95,110 95,22 88,18" fill="#2a2a3a" stroke="#2a2a3a" stroke-width=".5"/><polygon points="86,104 88,106 88,18 86,16" fill="#3d3d50" stroke="#2a2a3a" stroke-width=".5"/><polygon points="86,16 88,18 95,22 93,20" fill="#f9f9f9" opacity=".7"/><polygon points="164,144 171,140 171,52 164,56" fill="#2a2a3a" stroke="#2a2a3a" stroke-width=".5"/><polygon points="162,142 164,144 164,56 162,54" fill="#3d3d50" stroke="#2a2a3a" stroke-width=".5"/><polygon points="162,54 164,56 171,52 169,50" fill="#f9f9f9" opacity=".7"/><polygon points="66,117 72,113 72,25 66,29" fill="#2a2a3a" stroke="#2a2a3a" stroke-width=".5"/><polygon points="142,155 149,151 149,63 142,67" fill="#2a2a3a" stroke="#2a2a3a" stroke-width=".5"/><polygon points="88,106 164,144 164,141 88,103" fill="#2a2a3a" stroke="#2a2a3a" stroke-width=".3"/><polygon points="66,117 142,155 142,152 66,114" fill="#1e1e2a" stroke="#2a2a3a" stroke-width=".3"/><polygon points="88,62 164,100 164,97 88,59" fill="#2a2a3a" stroke="#2a2a3a" stroke-width=".3"/><polygon points="66,73 142,111 142,108 66,70" fill="#1e1e2a" stroke="#2a2a3a" stroke-width=".3"/><polygon points="88,18 164,56 164,53 88,15" fill="#2a2a3a" stroke="#2a2a3a" stroke-width=".3"/><polygon points="66,29 142,67 142,64 66,26" fill="#1e1e2a" stroke="#2a2a3a" stroke-width=".3"/><polygon points="88,62 164,100 142,111 66,73" fill="#ebe4d0" stroke="#bbb5a0" stroke-width="1"/><polygon points="88,62 164,100 164,103 88,65" fill="#d4cdb8" stroke="#bbb5a0" stroke-width=".5"/><polygon points="66,73 88,62 88,65 66,76" fill="#cdc6b2" stroke="#bbb5a0" stroke-width=".5"/><polygon points="88,18 164,56 142,67 66,29" fill="#ebe4d0" stroke="#bbb5a0" stroke-width="1"/><polygon points="88,18 164,56 164,59 88,21" fill="#d4cdb8" stroke="#bbb5a0" stroke-width=".5"/><polygon points="66,29 88,18 88,21 66,32" fill="#cdc6b2" stroke="#bbb5a0" stroke-width=".5"/><polygon points="82,67 158,105 157,113 81,75" fill="#c4c" opacity=".07"/><polygon points="79,62 155,100 155,103 79,65" fill="#c855d0" stroke="#8a25a0" stroke-width=".8" filter="url(#led-glow)" opacity=".96"/><circle cx="96" cy="68" r="1.5" fill="#f088ff" opacity=".9"/><circle cx="112" cy="76" r="1.5" fill="#f088ff" opacity=".9"/><circle cx="128" cy="84" r="1.5" fill="#f088ff" opacity=".9"/><circle cx="144" cy="92" r="1.5" fill="#f088ff" opacity=".9"/><polygon points="82,22 158,60 157,68 81,30" fill="#c4c" opacity=".07"/><polygon points="79,18 155,56 155,59 79,21" fill="#c855d0" stroke="#8a25a0" stroke-width=".8" filter="url(#led-glow)" opacity=".96"/><circle cx="96" cy="24" r="1.5" fill="#f088ff" opacity=".9"/><circle cx="112" cy="32" r="1.5" fill="#f088ff" opacity=".9"/><circle cx="128" cy="40" r="1.5" fill="#f088ff" opacity=".9"/><circle cx="144" cy="48" r="1.5" fill="#f088ff" opacity=".9"/><g transform="translate(92,75)"><ellipse cx="0" cy="0" rx="9" ry="4" fill="#2a2a3a" opacity=".1"/><ellipse cx="0" cy="-3" rx="8" ry="4" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/><ellipse cx="0" cy="-4" rx="5" ry="2.5" fill="#8aaa6a" stroke="#5a7a44" stroke-width=".5"/></g><g transform="translate(107,83)"><ellipse cx="0" cy="0" rx="9" ry="4" fill="#2a2a3a" opacity=".1"/><ellipse cx="0" cy="-3" rx="8" ry="4" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/><ellipse cx="0" cy="-4" rx="5" ry="2.5" fill="#8aaa6a" stroke="#5a7a44" stroke-width=".5"/></g><g transform="translate(123,90)"><ellipse cx="0" cy="0" rx="9" ry="4" fill="#2a2a3a" opacity=".1"/><ellipse cx="0" cy="-3" rx="8" ry="4" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/><ellipse cx="0" cy="-4" rx="5" ry="2.5" fill="#8aaa6a" stroke="#5a7a44" stroke-width=".5"/></g><g transform="translate(138,98)"><ellipse cx="0" cy="0" rx="9" ry="4" fill="#2a2a3a" opacity=".1"/><ellipse cx="0" cy="-3" rx="8" ry="4" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/><ellipse cx="0" cy="-4" rx="5" ry="2.5" fill="#8aaa6a" stroke="#5a7a44" stroke-width=".5"/></g><g transform="translate(92,31)"><ellipse cx="0" cy="0" rx="9" ry="4" fill="#2a2a3a" opacity=".1"/><ellipse cx="0" cy="-3" rx="8" ry="4" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/><ellipse cx="0" cy="-4" rx="5" ry="2.5" fill="#8aaa6a" stroke="#5a7a44" stroke-width=".5"/></g><g transform="translate(107,39)"><ellipse cx="0" cy="0" rx="9" ry="4" fill="#2a2a3a" opacity=".1"/><ellipse cx="0" cy="-3" rx="8" ry="4" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/><ellipse cx="0" cy="-4" rx="5" ry="2.5" fill="#8aaa6a" stroke="#5a7a44" stroke-width=".5"/></g><g transform="translate(123,46)"><ellipse cx="0" cy="0" rx="9" ry="4" fill="#2a2a3a" opacity=".1"/><ellipse cx="0" cy="-3" rx="8" ry="4" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/><ellipse cx="0" cy="-4" rx="5" ry="2.5" fill="#8aaa6a" stroke="#5a7a44" stroke-width=".5"/></g><g transform="translate(138,54)"><ellipse cx="0" cy="0" rx="9" ry="4" fill="#2a2a3a" opacity=".1"/><ellipse cx="0" cy="-3" rx="8" ry="4" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".8"/><ellipse cx="0" cy="-4" rx="5" ry="2.5" fill="#8aaa6a" stroke="#5a7a44" stroke-width=".5"/></g><line x1="85" y1="18" x2="85" y2="106" stroke="#3a3a78" stroke-width="2.5" stroke-linecap="round"/><line x1="85.6" y1="18" x2="85.6" y2="106" stroke="#6a6ab8" stroke-width=".9" stroke-linecap="round" opacity=".55"/><circle cx="85" cy="84" r="2.5" fill="#3a3a78" stroke="#22224a" stroke-width=".5"/><line x1="85" y1="84" x2="92" y2="87" stroke="#3a3a78" stroke-width="1" stroke-linecap="round"/><circle cx="85" cy="40" r="2.5" fill="#3a3a78" stroke="#22224a" stroke-width=".5"/><line x1="85" y1="40" x2="92" y2="43" stroke="#3a3a78" stroke-width="1" stroke-linecap="round"/></svg></div><dl class=pane-spec><dt>environment</dt><dd>One 2-tier rack, ~3 m², climate-isolated.</dd><dt>action space</dt><dd>Light schedule + intensity, nutrient mix, irrigation timing, harvest decision.</dd><dt>sensors</dt><dd>Cameras (overhead + side), EC / pH probes, water-flow meters, kWh meter, scale at harvest.</dd><dt>primary metric</dt><dd>Grams of edible biomass per kWh per cycle.</dd><dt>secondaries</dt><dd>Cycle time, water used, nutrient cost, reject rate.</dd><dt>guardrails</dt><dd>Nutrient-concentration ceiling, water-overflow drain, light-burn cutoff, max-cycle length.</dd></dl></div><div class="pane pane-4"><div class=pane-header><div class=pane-icon><svg viewBox="0 0 32 32" fill="none" stroke="currentcolor" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round"><rect x="3" y="25" width="26" height="4" rx="1"/><circle cx="8" cy="25" r="2" fill="currentcolor"/><line x1="8" y1="25" x2="8" y2="14"/><circle cx="8" cy="14" r="2" fill="currentcolor"/><line x1="8" y1="14" x2="20" y2="9"/><circle cx="20" cy="9" r="2" fill="currentcolor"/><path d="M20 9l5 2-1 4"/></svg></div><h4 class=pane-title>Warehouse pick-and-pack cell</h4></div><p class=pane-blurb>An off-the-shelf robot arm in front of mixed shelves and a conveyor. The boring industrial baseline — still worth opening up.</p><div class=pane-viz><svg viewBox="0 0 280 160" aria-label="Isometric warehouse pick-and-pack robot cell"><polygon points="20,148 260,148 235,68 45,68" fill="#ebe4d0" stroke="#bbb5a0" stroke-width=".6"/><g stroke="#d2cab2" stroke-width=".4" fill="none" opacity=".7"><line x1="95" y1="75" x2="65" y2="140"/><line x1="135" y1="75" x2="115" y2="140"/><line x1="175" y1="75" x2="165" y2="140"/><line x1="215" y1="75" x2="215" y2="140"/><line x1="52" y1="90" x2="228" y2="90"/><line x1="38" y1="110" x2="242" y2="110"/><line x1="26" y1="130" x2="255" y2="130"/></g><rect x="48" y="72" width="3" height="46" fill="#2a2a3a"/><rect x="56" y="80" width="3" height="48" fill="#2a2a3a"/><rect x="96" y="80" width="3" height="48" fill="#2a2a3a"/><polygon points="48,78 99,78 99,82 48,82" fill="#d8d0c4" stroke="#bbb5a0" stroke-width=".5"/><polygon points="48,78 56,84 56,88 48,82" fill="#c0b8aa" stroke="#bbb5a0" stroke-width=".5"/><polygon points="56,84 99,84 99,88 56,88" fill="#d8d0c4" stroke="#bbb5a0" stroke-width=".5"/><polygon points="48,98 99,98 99,101 48,101" fill="#d8d0c4" stroke="#bbb5a0" stroke-width=".5"/><polygon points="48,98 56,104 56,107 48,101" fill="#c0b8aa" stroke="#bbb5a0" stroke-width=".5"/><polygon points="56,104 99,104 99,107 56,107" fill="#d8d0c4" stroke="#bbb5a0" stroke-width=".5"/><g><polygon points="60,72 73,72 73,78 60,78" fill="#c4a882" stroke="#a08860" stroke-width=".5"/><polygon points="60,72 64,70 77,70 73,72" fill="#d4b892" stroke="#a08860" stroke-width=".5"/><polygon points="73,72 77,70 77,78 73,78" fill="#b09870" stroke="#a08860" stroke-width=".5"/></g><g><polygon points="78,72 88,72 88,78 78,78" fill="#d8d0c4" stroke="#a08860" stroke-width=".5"/><polygon points="78,72 82,70 92,70 88,72" fill="#e8e0d4" stroke="#a08860" stroke-width=".5"/><polygon points="88,72 92,70 92,78 88,78" fill="#c8c0b4" stroke="#a08860" stroke-width=".5"/></g><g><polygon points="62,92 74,92 74,98 62,98" fill="#d8d0c4" stroke="#a08860" stroke-width=".5"/><polygon points="62,92 66,90 78,90 74,92" fill="#e8e0d4" stroke="#a08860" stroke-width=".5"/><polygon points="74,92 78,90 78,98 74,98" fill="#c8c0b4" stroke="#a08860" stroke-width=".5"/></g><g><polygon points="79,92 90,92 90,98 79,98" fill="#c4a882" stroke="#a08860" stroke-width=".5"/><polygon points="79,92 83,90 94,90 90,92" fill="#d4b892" stroke="#a08860" stroke-width=".5"/><polygon points="90,92 94,90 94,98 90,98" fill="#b09870" stroke="#a08860" stroke-width=".5"/></g><polygon points="130,120 152,120 152,130 130,130" fill="#2a2a3a" stroke="#1a1a2a" stroke-width=".6"/><polygon points="130,120 136,116 158,116 152,120" fill="#3a3a4a" stroke="#1a1a2a" stroke-width=".6"/><polygon points="152,120 158,116 158,130 152,130" fill="#222232" stroke="#1a1a2a" stroke-width=".6"/><ellipse cx="141" cy="116" rx="9" ry="4" fill="#3a3a78" stroke="#2a2a58" stroke-width=".8"/><rect x="137" y="93" width="8" height="24" rx="2" fill="#2a2a3a" stroke="#1a1a2a" stroke-width=".6" transform="rotate(-10,141,116)"/><circle cx="136" cy="91" r="5" fill="#3a3a78" stroke="#2a2a58" stroke-width=".8"/><rect x="127" y="76" width="7" height="18" rx="2" fill="#2a2a3a" stroke="#1a1a2a" stroke-width=".6" transform="rotate(15,136,91)"/><circle cx="124" cy="78" r="3.5" fill="#3a3a78" stroke="#2a2a58" stroke-width=".7"/><rect x="119" y="73" width="3" height="8" rx="1" fill="#2a2a3a" stroke="#1a1a2a" stroke-width=".5" transform="rotate(-15,121,77)"/><rect x="124" y="73" width="3" height="8" rx="1" fill="#2a2a3a" stroke="#1a1a2a" stroke-width=".5" transform="rotate(15,126,77)"/><polygon points="168,130 258,130 258,148 168,148" fill="#4a4a5a" stroke="#3a3a4a" stroke-width=".6"/><polygon points="155,122 168,130 258,130 245,122" fill="#5a5a6a" stroke="#3a3a4a" stroke-width=".6"/><g stroke="#6a6a7a" stroke-width=".5" opacity=".8"><line x1="180" y1="130" x2="180" y2="148"/><line x1="193" y1="130" x2="193" y2="148"/><line x1="206" y1="130" x2="206" y2="148"/><line x1="219" y1="130" x2="219" y2="148"/><line x1="232" y1="130" x2="232" y2="148"/><line x1="245" y1="130" x2="245" y2="148"/></g><g><polygon points="195,118 210,118 210,127 195,127" fill="#c4a882" stroke="#a08860" stroke-width=".5"/><polygon points="195,118 199,115 214,115 210,118" fill="#d4b892" stroke="#a08860" stroke-width=".5"/><polygon points="210,118 214,115 214,127 210,127" fill="#b09870" stroke="#a08860" stroke-width=".5"/></g><rect x="253" y="80" width="5" height="55" rx="1" fill="#2a2a3a" stroke="#1a1a2a" stroke-width=".5"/><rect x="254" y="85" width="3" height="40" rx=".5" fill="#3a3a78" opacity=".7"/><circle cx="255.5" cy="90" r="1.2" fill="#f9f9f9" opacity=".8"/><circle cx="255.5" cy="100" r="1.2" fill="#f9f9f9" opacity=".8"/><circle cx="255.5" cy="110" r="1.2" fill="#f9f9f9" opacity=".8"/><circle cx="255.5" cy="120" r="1.2" fill="#f9f9f9" opacity=".8"/><g stroke="#3a3a78" stroke-width=".4" opacity=".3" stroke-dasharray="2,3"><line x1="254" y1="90" x2="168" y2="110"/><line x1="254" y1="100" x2="168" y2="120"/><line x1="254" y1="110" x2="168" y2="130"/></g></svg></div><dl class=pane-spec><dt>environment</dt><dd>Fenced robot cell, ~9 m², fixed lighting.</dd><dt>action space</dt><dd>Arm motion, grip force, scan, label, place on conveyor.</dd><dt>sensors</dt><dd>Wrist camera, overhead camera, barcode scanner, weight pad, joint torques.</dd><dt>primary metric</dt><dd>Correctly packed orders per hour.</dd><dt>secondaries</dt><dd>Mis-pick rate, damage rate, energy per pick.</dd><dt>guardrails</dt><dd>Safety fence + light curtain, e-stop, force-limited arm, max-velocity cap.</dd></dl></div><div class="pane pane-5"><div class=pane-header><div class=pane-icon><svg viewBox="0 0 32 32" fill="none" stroke="currentcolor" stroke-width="1.5"><line x1="6" y1="6" x2="26" y2="26"/><line x1="6" y1="26" x2="26" y2="6"/><circle cx="6" cy="6" r="3"/><circle cx="26" cy="6" r="3"/><circle cx="6" cy="26" r="3"/><circle cx="26" cy="26" r="3"/><rect x="12" y="12" width="8" height="8" rx="1" fill="currentcolor"/></svg></div><h4 class=pane-title>Outdoor sprayer drone</h4></div><p class=pane-blurb>A tank-equipped drone with a multispectral camera, working a real field. The hardest adversarial-robustness story of the bunch.</p><div class=pane-viz><svg viewBox="0 0 280 160" aria-label="Isometric field with sprayer drone and geofence boundary"><polygon points="12,148 268,148 243,50 37,50" fill="none" stroke="#8b1a1a" stroke-width="1" stroke-dasharray="4,3"/><polygon points="22,142 258,142 235,62 45,62" fill="#ebe4d0" stroke="#bbb5a0" stroke-width=".6"/><g fill="#7a9a5c" stroke="none" opacity=".85"><polygon points="45,62 91,62 85,82 39,82"/></g><g fill="#5a7a44" stroke="none" opacity=".85"><polygon points="91,62 137,62 131,82 85,82"/></g><g fill="#7a9a5c" stroke="none" opacity=".85"><polygon points="137,62 183,62 177,82 131,82"/></g><g fill="#5a7a44" stroke="none" opacity=".85"><polygon points="183,62 229,62 223,82 177,82"/></g><g fill="#5a7a44" stroke="none" opacity=".75"><polygon points="39,82 85,82 79,102 33,102"/></g><g fill="#7a9a5c" stroke="none" opacity=".75"><polygon points="85,82 131,82 125,102 79,102"/></g><g fill="#5a7a44" stroke="none" opacity=".75"><polygon points="131,82 177,82 171,102 125,102"/></g><g fill="#7a9a5c" stroke="none" opacity=".75"><polygon points="177,82 223,82 217,102 171,102"/></g><g fill="#7a9a5c" stroke="none" opacity=".65"><polygon points="33,102 79,102 73,122 27,122"/></g><g fill="#5a7a44" stroke="none" opacity=".65"><polygon points="79,102 125,102 119,122 73,122"/></g><g fill="#7a9a5c" stroke="none" opacity=".65"><polygon points="125,102 171,102 165,122 119,122"/></g><g fill="#5a7a44" stroke="none" opacity=".65"><polygon points="171,102 217,102 211,122 165,122"/></g><g fill="#5a7a44" stroke="none" opacity=".55"><polygon points="27,122 73,122 67,142 21,142"/></g><g fill="#7a9a5c" stroke="none" opacity=".55"><polygon points="73,122 119,122 113,142 67,142"/></g><g fill="#5a7a44" stroke="none" opacity=".55"><polygon points="119,122 165,122 159,142 113,142"/></g><g fill="#7a9a5c" stroke="none" opacity=".55"><polygon points="165,122 211,122 205,142 159,142"/></g><polygon points="140,75 148,142 132,142" fill="#a8c4d8" opacity=".3"/><polygon points="140,75 162,140 118,140" fill="#a8c4d8" opacity=".12"/><g transform="translate(140,52)"><line x1="-22" y1="-9" x2="22" y2="9" stroke="#2a2a3a" stroke-width="1.6"/><line x1="-22" y1="9" x2="22" y2="-9" stroke="#2a2a3a" stroke-width="1.6"/><circle cx="-22" cy="-9" r="6" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1"/><circle cx="22" cy="-9" r="6" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1"/><circle cx="-22" cy="9" r="6" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1"/><circle cx="22" cy="9" r="6" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1"/><rect x="-8" y="-6" width="16" height="12" rx="2" fill="#2a2a3a"/><circle cx="0" cy="0" r="2" fill="#f9f9f9" opacity=".5"/><rect x="-2" y="6" width="4" height="5" rx="1" fill="#3a3a78" opacity=".8"/></g></svg></div><dl class=pane-spec><dt>environment</dt><dd>A bounded field plot, outdoor, with weather and bystanders.</dd><dt>action space</dt><dd>Flight path, spray nozzle on/off, dosage rate.</dd><dt>sensors</dt><dd>RGB + multispectral camera, GPS, IMU, tank-level sensor, wind sensor.</dd><dt>primary metric</dt><dd>Pest pressure reduction, normalised by chemical applied.</dd><dt>secondaries</dt><dd>Chemical drift, energy, flight time, area covered.</dd><dt>guardrails</dt><dd>Geofence + tether, no-fly buffer around bystanders, chemical-flow ceiling, weather lockout.</dd></dl></div><div class="pane pane-6"><div class=pane-header><div class=pane-icon><svg viewBox="0 0 32 32" fill="none" stroke="currentcolor" stroke-width="1.5" stroke-linecap="round"><rect x="3" y="7" width="26" height="18" rx="3"/><circle cx="12" cy="16" r="3"/><circle cx="12" cy="16" r="1" fill="currentcolor"/><line x1="18" y1="13" x2="25" y2="13"/><line x1="18" y1="16" x2="25" y2="16"/><line x1="18" y1="19" x2="25" y2="19"/></svg></div><h4 class=pane-title>Gel electrophoresis station</h4></div><p class=pane-blurb>An agarose gel tray, a power supply, and a UV camera. Set voltage and run time, then image the separated bands.</p><div class=pane-viz><svg viewBox="0 0 280 160" aria-label="Isometric gel electrophoresis station with gel tray and power supply"><polygon points="15,148 258,148 272,108 30,108" fill="#ebe4d0" stroke="#bbb5a0" stroke-width=".6"/><g stroke="#d2cab2" stroke-width=".35" fill="none" opacity=".5"><line x1="30" y1="108" x2="15" y2="148"/><line x1="90" y1="108" x2="75" y2="148"/><line x1="150" y1="108" x2="140" y2="148"/><line x1="210" y1="108" x2="205" y2="148"/><line x1="30" y1="108" x2="272" y2="108"/><line x1="22" y1="128" x2="265" y2="128"/></g><polygon points="38,52 188,52 206,82 56,82" fill="#3a4a5a" stroke="#2a3a4a" stroke-width=".7"/><polygon points="56,82 206,82 206,102 56,102" fill="#2a3a4a" stroke="#1a2a3a" stroke-width=".5"/><polygon points="38,52 56,82 56,102 38,72" fill="#243444" stroke="#1a2a3a" stroke-width=".5"/><polygon points="46,57 185,57 200,79 61,79" fill="#dde8c0" stroke="#b8c890" stroke-width=".5" opacity=".96"/><polygon points="39,53 46,57 185,57 200,79 206,79 206,82 56,82 38,52" fill="#7ab4d8" opacity=".16"/><g stroke="#aeba98" stroke-width=".45" opacity=".65"><line x1="70" y1="57" x2="85" y2="79"/><line x1="97" y1="57" x2="112" y2="79"/><line x1="124" y1="57" x2="139" y2="79"/><line x1="151" y1="57" x2="166" y2="79"/><line x1="178" y1="57" x2="193" y2="79"/></g><g fill="#8a9278" opacity=".75"><rect x="48" y="56" width="4" height="2.5" rx=".5"/><rect x="74" y="56" width="4" height="2.5" rx=".5"/><rect x="100" y="56" width="4" height="2.5" rx=".5"/><rect x="127" y="56" width="4" height="2.5" rx=".5"/><rect x="154" y="56" width="4" height="2.5" rx=".5"/><rect x="180" y="56" width="4" height="2.5" rx=".5"/></g><polygon points="49,61 68,61 69,63 50,63" fill="#1a2a1a" opacity=".72"/><polygon points="50,66 69,66 70,68 51,68" fill="#1a2a1a" opacity=".65"/><polygon points="51,71 70,71 71,73 52,73" fill="#1a2a1a" opacity=".58"/><polygon points="52,75 71,75 72,77 53,77" fill="#1a2a1a" opacity=".5"/><polygon points="101,64 123,64 124,66 102,66" fill="#1a2a1a" opacity=".88"/><polygon points="103,74 125,74 126,76 104,76" fill="#1a2a1a" opacity=".8"/><polygon points="128,67 150,67 151,69 129,69" fill="#1a2a1a" opacity=".92"/><polygon points="155,65 175,65 176,67 156,67" fill="#1a2a1a" opacity=".45"/><polygon points="157,73 177,73 178,75 158,75" fill="#1a2a1a" opacity=".4"/><polygon points="218,73 250,73 256,88 224,88" fill="#2a2a3a" stroke="#1a1a28" stroke-width=".6"/><polygon points="224,88 256,88 256,108 224,108" fill="#1a1a28" stroke="#111" stroke-width=".5"/><polygon points="218,73 224,88 224,108 218,93" fill="#222232" stroke="#111" stroke-width=".5"/><rect x="226" y="76" width="24" height="9" rx="1" fill="#0d1a2a" stroke="#4a6aaa" stroke-width=".4"/><text x="229" y="83" font-size="6.5" fill="#4da" font-family="monospace">80V 22m</text><path d="M218 78q-11-2-12 5" stroke="#c33" stroke-width="1.2" fill="none" stroke-linecap="round"/><path d="M218 85q-10 1-12 4" stroke="#1a1a1a" stroke-width="1.2" fill="none" stroke-linecap="round"/><circle cx="206" cy="83" r="2.2" fill="#c33" stroke="#911" stroke-width=".4"/><circle cx="56" cy="84" r="2.2" fill="#1a1a1a" stroke="#333" stroke-width=".4"/></svg></div><dl class=pane-spec><dt>environment</dt><dd>Benchtop — gel box with buffer, power supply, UV transilluminator.</dd><dt>action space</dt><dd>Set voltage (10–150 V), run time, and sample loading volumes per lane.</dd><dt>sensors</dt><dd>UV camera, voltmeter, timer, buffer-level sensor.</dd><dt>primary metric</dt><dd>Target-band separation score at imaging time.</dd><dt>secondaries</dt><dd>Run time, buffer consumption, gel waste.</dd><dt>guardrails</dt><dd>Voltage ceiling (150 V), run-time cap, UV shield interlock, buffer-low cutoff.</dd></dl></div></div></div>
<h2 id="anatomy-of-a-physical-eval">Anatomy of a physical eval</h2>
<p>The diagram below sketches one possible
anatomy for a physica eval (this might not be complete, but take it as a useful starting point).</p>
<figure class=fullwidth><div class=parts-fig><svg viewBox="0 0 780 420" aria-label="Exploded diagram of the components of a physical eval"><polygon points="190,300 530,300 480,210 240,210" fill="#ebe4d0" stroke="#bbb5a0" stroke-width=".6"/><g stroke="#d2cab2" stroke-width=".4" fill="none" opacity=".7"><line x1="275" y1="220" x2="225" y2="290"/><line x1="320" y1="220" x2="285" y2="290"/><line x1="365" y1="220" x2="345" y2="290"/><line x1="410" y1="220" x2="405" y2="290"/><line x1="255" y1="225" x2="465" y2="225"/><line x1="245" y1="250" x2="475" y2="250"/><line x1="225" y1="280" x2="490" y2="280"/></g><polygon class="parts-geofence" points="178,308 542,308 488,200 232,200" fill="none" stroke="#8b1a1a" stroke-width="1" stroke-dasharray="4,3"/><g><ellipse cx="290" cy="232" rx="18" ry="11" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".7"/><rect x="287" y="240" width="6" height="8" fill="#6b4a2a"/></g><g><ellipse cx="380" cy="232" rx="18" ry="11" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".7"/><rect x="377" y="240" width="6" height="8" fill="#6b4a2a"/></g><g><ellipse cx="275" cy="278" rx="24" ry="14" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".7"/><rect x="272" y="287" width="6" height="10" fill="#6b4a2a"/></g><g><ellipse cx="395" cy="278" rx="24" ry="14" fill="#7a9a5c" stroke="#5a7a44" stroke-width=".7"/><rect x="392" y="287" width="6" height="10" fill="#6b4a2a"/></g><g><line x1="210" y1="295" x2="210" y2="215" stroke="#2a2a3a" stroke-width="1.4"/><rect x="202" y="208" width="16" height="10" rx="2" fill="#2a2a3a"/><circle cx="210" cy="213" r="1.5" fill="#f9f9f9"/></g><g><line x1="510" y1="295" x2="510" y2="215" stroke="#2a2a3a" stroke-width="1.4"/><rect x="502" y="208" width="16" height="10" rx="2" fill="#2a2a3a"/><circle cx="510" cy="213" r="1.5" fill="#f9f9f9"/></g><ellipse cx="360" cy="265" rx="22" ry="5" fill="#000" opacity=".06"/><g transform="translate(360, 165)"><line x1="-22" y1="-9" x2="22" y2="9" stroke="#2a2a3a" stroke-width="1.8"/><line x1="-22" y1="9" x2="22" y2="-9" stroke="#2a2a3a" stroke-width="1.8"/><circle cx="-22" cy="-9" r="6" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1"/><circle cx="22" cy="-9" r="6" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1"/><circle cx="-22" cy="9" r="6" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1"/><circle cx="22" cy="9" r="6" fill="#f9f9f9" stroke="#3a3a78" stroke-width="1"/><rect x="-8" y="-6" width="16" height="12" rx="2" fill="#2a2a3a"/></g><g transform="translate(470, 130)"><rect x="0" y="0" width="92" height="44" rx="3" fill="#f9f9f9" stroke="#3a3a78" stroke-width=".7"/><text x="8" y="16" font-family="Inter, sans-serif" font-size="8" letter-spacing=".08em" fill="#888">FRUIT SAVED</text><text x="8" y="36" font-family="Inter, sans-serif" font-size="18" font-weight="600" fill="#3a3a78">94%</text><text x="78" y="36" font-family="Inter, sans-serif" font-size="9" fill="#5a9a5a" text-anchor="end">↑ 3</text></g><g><path d="M650 320Q540 280 380 175" fill="none" stroke="#3a3a78" stroke-width="1" stroke-dasharray="3,3" marker-end="url(#arrowhead)"/><circle cx="650" cy="320" r="3" fill="#3a3a78"/></g><defs><marker id="arrowhead" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto"><path d="M0 0 10 5 0 10z" fill="#3a3a78"/></marker></defs><g transform="translate(445, 213)"><rect x="0" y="4" width="20" height="26" rx="1" fill="#f9f9f9" stroke="#3a3a78" stroke-width=".8"/><rect x="6" y="0" width="8" height="8" rx="1" fill="#3a3a78"/><line x1="3" y1="14" x2="17" y2="14" stroke="#3a3a78" stroke-width=".6" opacity=".5"/><line x1="3" y1="18" x2="17" y2="18" stroke="#3a3a78" stroke-width=".6" opacity=".5"/><line x1="3" y1="22" x2="13" y2="22" stroke="#3a3a78" stroke-width=".6" opacity=".5"/></g><line x1="110" y1="90" x2="240" y2="210" stroke="#888" stroke-width=".6" stroke-dasharray="2,3"/><g class="cb"><circle cx="110" cy="90" r="11" fill="#2a2a3a"/><text x="110" y="94" text-anchor="middle" font-family="Inter, sans-serif" font-size="11" font-weight="600" fill="#f9f9f9">1</text></g><text x="128" y="86" font-family="Inter, sans-serif" font-size="10.5" font-weight="600" letter-spacing=".08em" fill="#3a3a78">ENVIRONMENT</text><text x="128" y="100" font-family="Inter, sans-serif" font-size="9.5" fill="#888">the orchard</text><line x1="110" y1="220" x2="210" y2="216" stroke="#888" stroke-width=".6" stroke-dasharray="2,3"/><g class="cb"><circle cx="110" cy="220" r="11" fill="#2a2a3a"/><text x="110" y="224" text-anchor="middle" font-family="Inter, sans-serif" font-size="11" font-weight="600" fill="#f9f9f9">2</text></g><text x="50" y="246" font-family="Inter, sans-serif" font-size="10.5" font-weight="600" letter-spacing=".08em" fill="#3a3a78">SENSORS</text><text x="50" y="260" font-family="Inter, sans-serif" font-size="9.5" fill="#888">perimeter cameras</text><line x1="360" y1="100" x2="360" y2="153" stroke="#888" stroke-width=".6" stroke-dasharray="2,3"/><g class="cb"><circle cx="360" cy="90" r="11" fill="#2a2a3a"/><text x="360" y="94" text-anchor="middle" font-family="Inter, sans-serif" font-size="11" font-weight="600" fill="#f9f9f9">3</text></g><text x="378" y="86" font-family="Inter, sans-serif" font-size="10.5" font-weight="600" letter-spacing=".08em" fill="#3a3a78">ACTION SPACE</text><text x="378" y="100" font-family="Inter, sans-serif" font-size="9.5" fill="#888">drone, deterrents</text><line x1="600" y1="152" x2="565" y2="152" stroke="#888" stroke-width=".6" stroke-dasharray="2,3"/><g class="cb"><circle cx="615" cy="152" r="11" fill="#2a2a3a"/><text x="615" y="156" text-anchor="middle" font-family="Inter, sans-serif" font-size="11" font-weight="600" fill="#f9f9f9">4</text></g><text x="633" y="148" font-family="Inter, sans-serif" font-size="10.5" font-weight="600" letter-spacing=".08em" fill="#3a3a78">METRIC</text><text x="633" y="162" font-family="Inter, sans-serif" font-size="9.5" fill="#888">+ secondaries</text><line x1="280" y1="370" x2="285" y2="312" stroke="#888" stroke-width=".6" stroke-dasharray="2,3"/><g class="cb"><circle cx="280" cy="380" r="11" fill="#2a2a3a"/><text x="280" y="384" text-anchor="middle" font-family="Inter, sans-serif" font-size="11" font-weight="600" fill="#f9f9f9">5</text></g><text x="298" y="376" font-family="Inter, sans-serif" font-size="10.5" font-weight="600" letter-spacing=".08em" fill="#3a3a78">GUARDRAILS</text><text x="298" y="390" font-family="Inter, sans-serif" font-size="9.5" fill="#888">geofence, no-fly buffers</text><line x1="615" y1="370" x2="650" y2="330" stroke="#888" stroke-width=".6" stroke-dasharray="2,3"/><g class="cb"><circle cx="615" cy="380" r="11" fill="#2a2a3a"/><text x="615" y="384" text-anchor="middle" font-family="Inter, sans-serif" font-size="11" font-weight="600" fill="#f9f9f9">6</text></g><text x="633" y="376" font-family="Inter, sans-serif" font-size="10.5" font-weight="600" letter-spacing=".08em" fill="#3a3a78">OPEN ACCESS</text><text x="633" y="390" font-family="Inter, sans-serif" font-size="9.5" fill="#888">remote operator input</text><line x1="615" y1="265" x2="466" y2="228" stroke="#888" stroke-width=".6" stroke-dasharray="2,3"/><g class="cb"><circle cx="615" cy="265" r="11" fill="#2a2a3a"/><text x="615" y="269" text-anchor="middle" font-family="Inter, sans-serif" font-size="11" font-weight="600" fill="#f9f9f9">7</text></g><text x="633" y="261" font-family="Inter, sans-serif" font-size="10.5" font-weight="600" letter-spacing=".08em" fill="#3a3a78">GOVERNANCE</text><text x="633" y="275" font-family="Inter, sans-serif" font-size="9.5" fill="#888">who sets the rules</text></svg></div><figcaption>Components of a physical eval.</figcaption></figure>
<ul>
<li><strong>An environment.</strong> The orchard, the bench, the cell line, the floor.</li>
<li><strong>An action space the eval can verify.</strong> What a participant is
allowed to do — fly the drone, dispense the reagent, move the parts —
needs to be observable enough that the system can confirm what
happened.</li>
<li><strong>Sensors.</strong> What the eval uses to know the state of the world.
Cameras, scales, thermocouples, microbiology assays, a human spot-check.</li>
<li><strong>A primary metric of utility,</strong> plus secondary metrics (cost, time,
resource use, energy).</li>
<li><strong>Safeties and guardrails.</strong> A net to catch the drone, a kill switch,
a fenced area, an interlock. Whatever ensures that a participant
failing — or <em>trying</em> to break things — doesn’t damage the orchard
or hurt the birds. A physical eval is, by construction, a
public-facing physical system that gives partial control of real
hardware to whoever holds the current slot. Keeping it open without
becoming dangerous — and without sacrificing utility — is the
hardest layer of the stack (see
<em>Safety for physical evals</em>).</li>
<li><strong>Governance.</strong> The rules of the eval and who controls them. Who
decides the primary metric and when it can change? Who can introduce
external hardware or a remote-control override? Is slot time fixed
or auctioned? Who arbitrates disputes, and by what process? Good
governance is what distinguishes an eval that stays honest over years
from one that quietly drifts to serve whoever is running it at the
time.</li>
</ul>
<p>This list is not final. Different domains will surface
components not named here — calibration drift, biological containment,
human-in-the-loop sign-off, regulatory constraints — and the right
abstraction is going to settle as people actually build the things.</p>
<h3 id="challenges-for-physical-evals">Challenges for physical evals</h3>
<p>The following are some of the harder design problems that don’t have
clean answers yet — and that any serious physical eval effort will have
to confront.</p>
<p><em>Goodharting.</em> Any eval with a numeric target invites unintended ways to hit it — with the wrinkle that in a physical eval the unintended ways can cause real-world harm. An agent optimising fruit saved might drive the deterrent so aggressively that birds <em>and</em> orchard workers avoid the area: the metric goes up, the orchard becomes unusable.</p>
<p><em>Non-stationarity.</em> The physical world changes regardless. An orchard in week one is not the same orchard in week twelve —
season, weather, and pest population all shift. A wet-lab bench drifts
as reagent batches age. Field plots evolve. Comparing scores across time is therefore hard, sometimes impossible.</p>
<p><em>Sequential contamination.</em> Each participant leaves a trace for the
next. In a wetlab this is problematic — reagents consumed, cultures
disturbed, hardware worn — but the problem is general: stock depleted
in a warehouse cell, soil compacted on a field plot, bird behaviour
shifted by a heavy deterrence week. Sequential slots work for
environments with a natural or cheap reset; they don’t work for
environments where state accumulates.</p>
<p><em>Latency as a confound.</em> A participant operating remotely over the
internet sees the environment through a sensor stream and acts through a
command channel, both of which have variable latency. Two agents with
identical policies but different network conditions will produce
different results. This is especially visible in fast-moving
environments — a drone avoiding a collision, a robot arm catching a
falling object.</p>
<p><em>Observer effect.</em> The sensors required to score an eval change what is
being measured. A camera rig that watches a field plot for pest activity
may deter the pests on its own. A flow sensor on a reagent line changes
the thermal environment of the bench. In some domains the effect is
negligible; in others it will corrupt the primary metric.</p>
<h2 id="safety-for-physical-evals">Safety for physical evals</h2>
<p>A physical eval that anyone on the internet can operate is, by
construction, a public attack surface on a real-world physical system.
The participant at any given slot might be a well-behaved research
team, an AI agent following a poorly-aligned policy, or a person who
wants to break things on purpose. The eval has to keep working —
usefully, openly, safely — across all three.</p>
<p><em>Somebody not fully trusted is about to make
the drone, the sprayer, the autoclave do something for the next twenty
minutes — what’s the worst that can happen, and how is it bounded?</em></p>
<h3 id="the-attack-surface">The attack surface</h3>
<p>A useful first pass is to categorise harms by who pays the cost:</p>
<ul>
<li><strong>Harm to the eval itself.</strong> The drone crashes, the cell line dies,
the robot arm jams. Cheap if the guardrails work — the operator
resets, the leaderboard absorbs the failure.</li>
<li><strong>Harm to the surrounding environment.</strong> Chemicals spill, the
orchard catches fire, a neighbouring field gets sprayed.</li>
<li><strong>Harm to humans.</strong> A bystander gets hit by the drone, an operator
gets burned, a patient sample gets switched. ⊕</li>
</ul>
<input type=checkbox id=mn-1781015997702878231 class=margin-toggle>
<p>The
lines between these categories blur in practice — chemical drift is
“environment” until a bystander walks through it.
The category that matters most, and the hardest to bound.</p>
<ul>
<li><strong>Information harms.</strong> Footage of bystanders or proprietary processes
leaves the eval site; the eval is used as a covert surveillance
platform; sensor streams are exfiltrated.</li>
<li><strong>Generation of dangerous artifacts.</strong> The wet-lab cell is steered
toward synthesising something harmful; the sprayer drone is
weaponised; the autoclave is used to destroy evidence.</li>
</ul>
<p>Categories 1–3 are about <em>what can happen during a slot.</em> Categories
4–5 are about <em>what can leave the eval afterwards.</em> They want different
defences, and a serious eval needs both.</p>
<h3 id="defences-worth-building-ai-gen">Defences worth building (AI GEN)</h3>
<p>None of the following is a finished answer. They are the moves worth
physical evals trying, evaluating, and writing up:</p>
<ul>
<li><strong>Time-slotting with audit.</strong> Single operator at a time, every
action logged, the whole slot replayable. The slowest defence and
the foundation everything else builds on.</li>
<li><strong>Action-space sandboxing.</strong> The eval enforces hard limits <em>inside</em>
its abstraction: max chemical per slot, max motion envelope, max
temperature ramp. The action space exposed to the operator is
strictly smaller than the action space the hardware can physically
produce.</li>
<li><strong>Dry-run validation.</strong> A submitted policy runs through a cheap
simulation pass first — not as the eval itself, but as a gate.
Refuses to execute on the physical system if the simulated run trips
any guardrail.</li>
<li><strong>Supervised / shadow modes.</strong> ⊕</li>
</ul>
<input type=checkbox id=mn-1781015997702937398 class=margin-toggle>
<p>Like a learner’s
permit: new operators get to compute actions but not actuate them
for the first N slots.
New operators run in
shadow mode (actions computed but not executed) for some number of
slots before they’re trusted with real actuation. Progressive trust
as the leaderboard accumulates evidence.</p>
<ul>
<li><strong>Anomaly cut-outs.</strong> A separate monitor watches for off-distribution
sensor readings, sudden command spikes, too-clever-by-half action
sequences — and pulls the kill-switch before the eval owner has to.</li>
<li><strong>Open red-teaming.</strong> Each eval publishes its threat model and
invites external researchers to attack it. The right way to find the
holes is to invite people to look.</li>
<li><strong>Skin in the game.</strong> Operators bond a small amount per slot,
refundable on clean completion, forfeited if an audit finds
violation. Aligns incentives without requiring trust upfront.</li>
</ul>
<p>Most of these are borrowed from adjacent fields — public cloud
security, scientific-facility time-sharing (telescope nights, beamline
schedules), bug-bounty programs, robotics-safety standards. <em>None</em> of
them have been worked out in detail for a public, openly-instrumented
physical system that AI agents are also supposed to operate. That’s a
research agenda in itself.</p>
<h2 id="an-open-movement-for-physical-evals">An open movement for physical evals</h2>
<p>Physical evals could become an open-source ecosystem: environments cheap
to set up, easy to fork, open to anyone with a problem worth
measuring.</p>
<p>The rest of this section traces the arc: where things have been (<em>Prior art</em>),
what an open ecosystem looks like in practice (<em>Open at every layer</em>),
what it would have to cost (<em>What’s the Raspberry Pi of a physical
eval?</em>).</p>
<h3 id="prior-art">Prior art</h3>
<p>Physical-world AI competitions aren’t new. The <a href="https://en.wikipedia.org/wiki/DARPA_Grand_Challenge" target="_blank" rel="noopener noreferrer">DARPA Grand Challenge</a>
 put
autonomous vehicles in the Mojave; the
<a href="https://en.wikipedia.org/wiki/DARPA_Robotics_Challenge" target="_blank" rel="noopener noreferrer">DARPA Robotics Challenge</a>

put humanoids through disaster-response courses; the
<a href="https://robohub.org/amazon-picking-challenge/" target="_blank" rel="noopener noreferrer">Amazon Picking Challenge</a>

ran in warehouse mock-ups for several years;</p>
<input type=checkbox id=sn-1781015997702973568 class=margin-toggle>
<p><a href="https://www.robocup.org/" target="_blank" rel="noopener noreferrer">RoboCup</a>
 has been running its soccer
leagues since 1997 — arguably the longest-lived physical eval in
continuous operation, and the one with the most literature on what
makes it work and what it ends up measuring.
RoboCup
has been doing its soccer leagues since the late 1990s; the <a href="https://www.indyautonomouschallenge.com/" target="_blank" rel="noopener noreferrer">Indy Autonomous Challenge</a>
 and
<a href="https://en.wikipedia.org/wiki/Roborace" target="_blank" rel="noopener noreferrer">Roborace</a>
 have put driverless
cars on real circuits.</p>
<p>What these have in common: each was (or is) a sponsor-led, time-limited
event with closed protocols and bespoke infrastructure. They produced
brilliant moments and a small library of papers; they were expensive to
build and harder to reproduce.</p>
<h3 id="whats-the-raspberry-pi-of-a-physical-eval">What’s the Raspberry Pi of a physical eval?</h3>
<p>The hard constraint on all of this is cost. A DARPA-class eval needs
millions of dollars and a multi-year program; even a modest
research-grade one runs into expensive sensors,
networking, fail-safe hardware, and the human labour to keep it
operating. That ceiling is what makes physical evals rare today — and
rare evals can’t be the basis of an ecosystem.</p>
<p>So one of the most important questions this community can keep returning
to is the one in the heading.⊕</p>
<input type=checkbox id=mn-1781015997703016073 class=margin-toggle>
<p>Stand-in for “the
cheapest plausible build”. The Raspberry Pi did this for hobbyist
computing; what’s the equivalent for physical evals?
What’s the bill of materials that brings a credible, instrumented,
openable physical eval down to the cost of a serious hobby project? Probably some mix of commodity sensors, a
single-board computer for the control loop, an open scheduling service
for time-share, off-the-shelf safety hardware, and a reference
orchestration stack that everyone forks. If the answer ends up being
<em>“a few hundred dollars and a weekend,”</em> the ecosystem can actually
form. If it stays at <em>“a few hundred thousand and a six-month build,”</em>
it stays a fantasy.</p>
<h3 id="open-at-every-layer">Open at every layer</h3>
<p>In order to further lower the cost for anyone to be able to set up (safely) their physical evals, we need to look at off-the-shelf hardware and an open source stack.</p>
<ul>
<li><strong>Open protocols.</strong> The spec of an eval (environment, action space,
sensors, metric, secondaries, guardrails) is published as a forkable
document, the same way a research benchmark is published.</li>
<li><strong>Open hardware.</strong> Sensor rigs, mechanical setups, fail-safe systems
default to off-the-shelf components, with reproducible bills of
materials and CAD files.</li>
<li><strong>Open software.</strong> Time-share scheduling, telemetry capture, scoring,
auditing — shared infrastructure, not a one-off codebase per eval.</li>
<li><strong>A community around it.</strong> People running, replicating, and forking
each other’s evals; people contributing sensor stacks and guardrail
designs; people maintaining the scoring code together. No single lab
can stand up enough physical evals to cover the interesting surface
of physical problems — a community can.</li>
</ul>
<h3 id="public-verifiability">Public verifiability</h3>
<p>The safety section above focuses on protecting the physical environment
from adversarial <em>participants</em>. There is a symmetric problem that gets
less attention: protecting participants — and the public — from
adversarial <em>eval runners</em>.</p>
<p>In an open world where anyone can wire a field, a lab bench, or a
warehouse cell to the internet and declare it a physical eval, the
operator controls the sensors, the scoring pipeline, and in a way, the ground
truth. A dishonest operator can inflate results for a preferred team,
suppress evidence of harm, or fabricate the physical record entirely.
If physical evals are going to carry weight — as procurement signals,
safety certifications, or policy inputs — the data they produce has to
be trustworthy independent of whether the runner is trustworthy.</p>
<p>This is a valuable research
direction in its own right. Some threads worth pulling:</p>
<ul>
<li><strong>Tamper-evident sensors.</strong> Hardware-attested video streams that can
be verified as unedited after the fact — the physical analogue of a
signed log.</li>
<li><strong>Trusted execution environments.</strong> Running the scoring pipeline
inside a TEE means the operator cannot modify results without
breaking the attestation, even if they control the host machine.</li>
<li><strong>Cross-checking sensor redundancy.</strong> Multiple independent sensor
modalities covering the same physical event make coordinated
fabrication harder: a weight sensor, a camera, and an RFID log all
have to agree.</li>
<li><strong>Third-party witnesses.</strong> Spot audits by an independent party —
human or automated — who can access raw sensor streams without going
through the operator’s pipeline.</li>
</ul>
<p>Combinations are likely to be
necessary, and the right combination will vary by domain. What a
wet-lab needs to prove that a synthesis actually ran differs from what
an orchard needs to prove that a drone actually flew a slot. Building
this layer — call it <em>physical eval verification</em> — is at least as
important as building the evals themselves to make the results publicly verifiable.</p>
<h3 id="a-darwinian-ecosystem">A Darwinian ecosystem</h3>
<p>Not every physical eval will be a good one. Some will be hard, some trivial.
Some will be well-structured; some will be a mess. Some will scale to many
participants; some will only ever host one team at a time. That’s fine — even
desirable.
The shape of “what makes a good physical eval” is going to emerge from people
building them, breaking them, and learning what each one actually measured.</p>
<p>Sketched, a public registry for such an ecosystem might look like this:</p>
<div class=physevals-board><div class=bbar><div class=dots><span></span><span></span><span></span></div><div class=url>physevals.io · open registry · 126 evals</div></div><div class=page><div class=page-head><div class=page-title>Physical eval registry</div><div class=page-sub>43 accepting slots</div></div><div class=eval-rows><div class=er-row><div class=er-icon style=background:#5a7a44>O</div><div class=er-name><div class=er-title>Orchard pest defence</div><div class=er-sub>agriculture · outdoor · Greenfields UK</div></div><div class="er-status open">● open</div><div class=er-metric>fruit saved / week</div><div class=er-teams>18 teams</div></div><div class=er-row><div class=er-icon style=background:#4a6ea5>W</div><div class=er-name><div class=er-title>pH adjustment bench</div><div class=er-sub>chemistry · indoor · benchtop</div></div><div class="er-status open">● open</div><div class=er-metric>ΔpH from target</div><div class=er-teams>11 teams</div></div><div class=er-row><div class=er-icon style=background:#0e7c6e>V</div><div class=er-name><div class=er-title>Indoor vertical farm</div><div class=er-sub>agriculture · controlled environment</div></div><div class="er-status limited">◑ 2 slots left</div><div class=er-metric>g / kWh / cycle</div><div class=er-teams>6 teams</div></div><div class=er-row><div class=er-icon style=background:#6a5a95>P</div><div class=er-name><div class=er-title>Pick-and-pack cell</div><div class=er-sub>logistics · warehouse robotics</div></div><div class="er-status open">● open</div><div class=er-metric>correct orders / hour</div><div class=er-teams>23 teams</div></div><div class=er-row><div class=er-icon style=background:#8a5a5a>S</div><div class=er-name><div class=er-title>Outdoor sprayer drone</div><div class=er-sub>agri-robotics · field · safety-vetted access</div></div><div class="er-status soon">○ coming soon</div><div class=er-metric>pest Δ / ml sprayed</div><div class=er-teams>—</div></div><div class=er-row><div class=er-icon style=background:#7a7a8a>+</div><div class=er-name><div class=er-title>Submit an eval</div><div class=er-sub>open spec · CC-BY · any domain</div></div><div class="er-status soon"></div><div class=er-metric></div><div class=er-teams>→</div></div></div><div class=page-foot><span>updated 26 May · specs CC-BY</span>
<span>physevals.io is imagined</span></div></div></div>
<h3 id="more-than-evals">More than evals</h3>
<p>Every execution of a physical eval produces something beyond a score: a
timestamped record of sensor readings, actions taken, and outcomes
observed, all under conditions that were defined in advance and held
constant across participants. That record has value on its own.</p>
<p>The most immediate use is data collection. A team that runs an agent on
the orchard eval for a week doesn’t just get a leaderboard position —
they accumulate labelled trajectories in a real environment that would
be expensive to stage deliberately. Even failed attempts are informative:
a drone that misses a bird on Tuesday has documentation of exactly what
the environment looked like and what the agent did.</p>
<p>For some categories of problem the step further is worth considering:
repurposing the eval environment as a training environment. The orchard
is already instrumented. The slot system already handles scheduling. If
the cost of running episodes is low enough — bird-deterrence is
essentially free to attempt, wet-lab synthesis is not — the same
infrastructure can run RL rollouts between evaluation windows. The
environment that scores a model on Monday can help train the next
version by Friday.</p>
<p>This doesn’t collapse the distinction between training and testing.
Eval integrity still requires held-out conditions, independent scoring,
and participants who didn’t design the environment. But the <em>hardware</em>
doesn’t have to be idle between eval slots, and the <em>data</em> generated
during evaluation doesn’t have to be discarded. For operators willing
to share trajectories under open licences, a physical eval site becomes
something closer to a living dataset — one that grows richer every time
a new agent takes a slot.</p>
<h2 id="physical-evals-as-a-market">Physical evals as a market</h2>
<p>Why should one set up a physical eval? Setting up a physical eval can be a way to crowdsource intelligence for an unsolved problem.</p>
<p>The eval ecosystem can act as a <em>market</em> for intelligence: any participant — human, agent, team, company, hobbyist —
can take a slot, attempt to saturate the metric, and submit. The
leaderboard answers the AI-selection question by revealing whose approach
actually delivers on the physical world.</p>
<p>Physical evals can be a way for problem-owners to
<em>delegate AI knowledge</em> to a market.⊕</p>
<input type=checkbox id=mn-1781015997703058758 class=margin-toggle>
<p>This is the same
shift that happened with bug bounties. A company didn’t have to
<em>predict</em> who the best vulnerability researchers were; they had to
publish the surface and the rules, and the market sorted itself
out.
The grower doesn’t pick a model.
The hospital doesn’t pick a model. The factory doesn’t pick a model. They
pick a problem worth instrumenting and let the world’s AI builders compete
to be the answer. As more domains follow suit, an aggregate picture
emerges of where AI is actually good.</p>
<h2 id="get-in-touch">Get in touch</h2>
<p>If any of this resonates, please write. Three good reasons:</p>
<ul>
<li><strong>Already working in this space.</strong> Compare notes — what’s been learned
about sensing, guardrails, or keeping a system honestly open will
save the next person a lot of time.</li>
<li><strong>Have a physical problem worth instrumenting as an eval.</strong>
Worth thinking through together — what to measure, how to keep
it safe to open up, how to make it interesting enough that people show
up to compete.</li>
<li><strong>Have an eval to propose.</strong> Even if hosting isn’t feasible right now,
good proposals are valuable — they’re what an ecosystem of physical evals
is made of.</li>
</ul>
<p>DM <a href="https://twitter.com/iamnotnicola" target="_blank" rel="noopener noreferrer">@iamnotnicola</a>
 on X.</p>
<p>Let’s turn more of the physical world into something AI can be measured
against — and use that to point AI at problems that actually matter.</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>This was written by Nicola Greco with support of AI. It was brainstormed as part of ARIA’s
<a href="https://aria.org.uk/opportunity-spaces/trust-everything-everywhere/scaling-trust/" target="_blank" rel="noopener noreferrer">Scaling Trust</a>

programme, in collaboration with Alex Obadia. Thanks to Ross Taylor (General Reasoning), whose thinking influenced some of our ideas on physical evals.</p>
]]></content:encoded></item></channel></rss>