May 18, 2026 · 4 min read · by Tomasz Chmielarz
We exist because most AI agency work is theatre
Demos that wow leadership and break in production are the default — and that's the gap we built ncodelab to fill.

May 18, 2026 · 4 min read · by Tomasz Chmielarz
Demos that wow leadership and break in production are the default — and that's the gap we built ncodelab to fill.

When someone tells me they "just deployed an AI agent," I now ask one question: what's the eval score? Nine times out of ten there's a long pause.
That pause is the entire gap.
The current playbook looks like this. A consulting firm wins a retainer to build an agent. Three weeks in, there's a demo that lights up Slack. Procurement signs a six-figure contract. Twelve weeks later the agent ships into a production tool that nobody on the operator's side actually uses, because — surprise — it gets the answer wrong on the cases that matter, and the team has quietly routed around it.
Nobody is intentionally building this outcome. It's the path of least resistance.
The demo is the work, because the demo is what gets the contract signed. Everything that comes after the demo — the evaluation harness, the integration, the cost-per-call dashboards, the rollback plan, the iteration cycle — is unglamorous. It doesn't fit in a slide. It's hard to bill at agency rates. So it gets compressed, then skipped, then forgotten.
Meanwhile the operator who actually has to live with the agent never had a meaningful say in what "works" means. They get handed a deployment and asked to validate it.
Three things, none of them clever.
The eval is the contract. Before we write a single prompt, we sit with the operator and build a test set of 30–100 real cases. They score them. We agree on what acceptable looks like as a number. That number is the deliverable. If the system doesn't hit the number, we don't claim success.
We integrate before we generalise. A narrow agent wired into one workflow with proper observability is more valuable than a flexible agent that no one trusts. We pick one job, ship it end-to-end, and only then ask what's next.
We stay attached until the metric moves. Most agencies leave when the project plan ends. We leave when the numbers stabilise. The difference is what makes our work compound instead of decay.
We work best with teams who:
We're not the right fit if you need a 50-slide AI strategy or a chatbot to put on the website. There are plenty of agencies for that.
Tell us what's broken. Two sentences is enough. We reply within two business days with either a yes, a sharp redirect, or a question worth your time.
We'd rather earn the first case study than fabricate one.