We gave an agent five days to build a marketplace. It worked, and it hurt

We spent several days testing whether an agent can receive a spec once and then move DoveryAI Market development on its own. The short answer: yes, this is possible. The more honest answer: it requires orchestration, recovery, strict acceptance, and constant tuning. Quality is still uneven and often weak, but as a concept for a long-running autonomous loop, the experiment proved itself.

AuthorKhasan Mukhabbatov and Mikhail Golov

We had a technical spec for DoveryAI Market: a B2B partner adds an item, an operator checks and publishes it, a buyer sees only a properly released offer, and the whole chain keeps statuses, item trust, and action history. We used that task as a founder-agent test: the agent receives a product goal and then plans and drives development in large steps.

This was not a normal coding assistant mode where a human writes the next instruction every ten minutes. We gave the spec once. After that, we did not write product code by hand. Most manual work was around Hermes: the system that orchestrates agent development, runs Codex/Claude, watches tasks, recovers stuck sessions, and helps continue the next step.

What worked

In several days, a working alpha slice of a CRM+marketplace appeared. The scale was roughly five days of experimentation, about 15 hours of active work per day. The important result was not the number of lines, but the fact that a connected product scenario appeared.

a partner can pass an item into the system;
an operator sees what needs to happen with that item;
after operator control, the item can enter the buyer storefront;
a buyer can express interest;
that interest returns to the operator and partner as a working signal.

This is not yet a production marketplace. Payments, OAuth, legal details, operations, and security cannot be treated as complete. But as an alpha scenario for checking a business hypothesis, it works: you can open the product, walk the main path, and discuss it concretely.

What mattered most

We were not testing whether a model can write one file. We were testing whether an agent can work for days toward one product goal. The result was useful: the agent really can move a product forward, return to bugs, fix scenarios, add tests, update documentation, and continue the next step.

But it quickly became clear that a model plus repository access is not enough. You need a system around the agent. Hermes had to be improved during the experiment: better planning, recovery for stuck sessions, result checks, business-value framing instead of file lists, and separate frontend/UI control.

Possible

An agent can receive a large product goal and move it for several days without a human writing the code manually.

Painful

Without orchestration, recovery, acceptance criteria, and checks, the process quickly becomes fragile.

Quality is weak for now

A functionally working screen appears much faster than a screen you would confidently show as product.

Where quality broke down

The first problem was that Hermes was not ready out of the box for this autonomous development mode. Founder-mode needs a long horizon, recovery after failures, active task control, planning quality, and a clear definition of useful product output.

The second problem was that the agent often tried to pick tasks that were too small. A local improvement might look technically fine but barely move the product. In startup mode, that is not enough: the task needs to create a visible business or product shift.

The third problem was UX/UI. An agent can create a functionally working screen that looks like a technical demo panel instead of a marketplace or CRM. After seeing that, we tightened first-viewport requirements, removed technical language from user screens, oriented the UI toward familiar marketplace patterns, and required browser proof.

The fourth problem was external integrations. OAuth, payments, legal and operational scenarios cannot be proven without a real integration environment, keys, sandbox conditions, and responsible decisions. The agent can prepare boundaries, statuses, and diagnostics, but production-ready integrations require a real loop.

What the experiment proved and what it did not
Area	Status	Comment
Long-running agent work	Proven as a concept	The agent can hold one product goal for several days and continue working.
Alpha marketplace	Works	The partner -> operator -> storefront -> buyer interest path appeared.
UI quality	Uneven	Visual acceptance is required or the result becomes a technical demo panel.
Integrations	Not proven to production	Without real sandbox and keys, you mostly get boundaries and diagnostics.
Autonomy without a system	Not reliable	You need tools, recovery, scope, checks, audit, and critical-step control.

What this says about trusting agents

The main conclusion matches our broader position: trust in an agent is built not by believing in the model, but by the architecture around it. If one agent gets too wide a scope, access to everything, and the expectation that it will be product, UX, backend, QA, and DevOps at once, there will be many mistakes. The same would happen to a human in a badly organized role.

Narrow role: the agent should understand its area of responsibility.
Limited access: zero-trust is better than hoping the model will be careful.
Observability: actions should be logged and verifiable.
Acceptance: compilation and tests are not the same as product readiness.
Recovery: long-running work must continue after stuck sessions and failed steps.

Why this still matters

Honestly, the quality is still bad compared with what we want from a mature product. The agent often produces something working but rough. Sometimes it drifts into small tasks. Sometimes it leaves weak interfaces. Sometimes the surrounding system needs a lot of tuning. So it would be wrong to sell this as fully autonomous production development.

But as a concept and as a toy that can work for a long time, the experiment proved itself. This is no longer a one-off prompt or a nice screenshot. It is a mode where an agent can move one business hypothesis for several days, assemble a connected product scenario, and leave behind something concrete to discuss.

What this means for business

The founder-agent approach already looks applicable for fast alpha and MVP tasks, especially when the product can be expressed through roles, statuses, and scenarios. It is useful when you need not a presentation, but a working object for discussion: what path the partner takes, where an operator is needed, what the buyer sees, which statuses make sense, and where integrations are still prepared but not ready.

If the agent system has a narrow scope, tools, the right aggressiveness mode, and normal acceptance, it can handle part of the work not much worse than a person in a similar role. But this does not yet apply to security, payments, legally sensitive actions, and other areas where the cost of error is high.

Bottom line

We gave the agent a spec once and did not manage it like a normal developer after that. In several days, it assembled a working alpha slice of a CRM+marketplace and showed that founder-agent mode is realistic.

But the main lesson is not that "AI writes the product by itself." The lesson is different: autonomous development starts working when the model has the right management system around it. Without that system, the agent quickly drifts into small tasks, weak UI, unclear decisions, and recovery problems. With it, the agent can already move a product hypothesis forward for several days.

Related context: trust in agents, tools vs skills, and agent aggressiveness.

Article authorKhasan Mukhabbatov and Mikhail Golov