The eval is the product.
If you can't measure 'better', you can't get there. Everything we build starts with an eval set, ends with an eval score, and is improved against the eval over time.
About
ncodelab is a one-person AI automation studio (for now), based in Warsaw, working with teams across Europe and the US.
Most AI agency work in 2026 is theatre: demos that wow leadership and break in production, RAG pipelines that hallucinate citations, agents that score great on Twitter and fail on Tuesday. We started ncodelab to do the unglamorous half of the work — evaluation, observability, integration, iteration — because that's the half that determines whether the system actually ships.
If you can't measure 'better', you can't get there. Everything we build starts with an eval set, ends with an eval score, and is improved against the eval over time.
We use models, prompts, and tools you've heard of. Novelty is a liability when something needs to run unattended at 3 AM.
One workflow, one agent, one metric. Then the next. Big-bang projects are how AI work goes to die.
Right now: Tomasz Chmielarz — engineer, ten years in product, former tech lead. Bringing in collaborators per project rather than building a generic agency bench.