<- All articles

After dozens of tests we crossed everyone else out: only Anthropic and OpenAI remained

As of 09.05.2026, our working choice remains between OpenAI and Anthropic. This conclusion is based on practical use of models in product development, programming, architecture, and agent execution.

Framing the question

When choosing an LLM provider, teams often look at token price, public benchmarks, or brand volume. For real product and engineering work that is not enough. In a working environment, other things matter more: result quality, stability on long tasks, behavior in agent scenarios, handling of structured outputs, and the amount of manual rework needed afterward.

As of 09.05.2026, our practical choice remains between OpenAI and Anthropic. We use both approaches because this is where the main working compromise lies today. Other providers sometimes offer a much lower token price, but by overall output quality they still lag enough that the total cost of delivering the product ends up higher.

Scope of the study

The conclusions in this piece are based on practical use of models across several classes of tasks: product development, day-to-day programming, bug fixing, architecture work, handling formal response schemas, and agent scenarios in which the model has to do more than answer - it has to carry a work cycle to completion.

We deliberately do not treat this as an academic benchmark. This is an applied assessment of providers by usefulness in real work: how much quality output you get per unit of time, how much manual finishing is needed after the model's answer, and how predictably the system behaves over long cycles.

Key conclusion

If you want the short version, it is this: OpenAI and Anthropic remain the two main working options today. Each has its own limitations, but these two still provide the strongest combination of quality, stability, and practical usefulness.

How we look at provider choice as of 09.05.2026
ProviderPlusMinusOur conclusion
OpenAIA strong general-purpose working stack for most tasks.Does not support oneOf, which makes JSON Schema work poorly in more complex scenarios.One of the two main working options.
AnthropicVery strong in reasoning, architecture, and difficult code investigations.If Claude Code is used as an AI agent, hitting the subscription limit can leave the task unfinished when the user is vibe coding.The second main working option.
Other providersSometimes noticeably cheaper per token.Quality is weaker on average, which increases rework and the final cost of the result.We do not put them into the main stack yet.

Observation 1. OpenAI is stronger as a general-purpose stack, but has a limitation in structured output

In our practice, OpenAI remains one of the two main working options for everyday product and engineering tasks. However, it has one specific limitation that materially affects integration quality in systems-oriented scenarios: the lack of oneOf support.

In practice this degrades work with JSON Schema where the response schema is truly complex. The problem becomes visible in systems where the model output has to validate against a strict machine schema with branching formats, typed workflows, and downstream automated processing.

In other words, the limitation shows up most clearly in infrastructure. The more orchestration, validation, and downstream processing there is in the system, the more this limitation influences architectural choices.

Observation 2. Anthropic is strong in reasoning and architecture, but subscription limits matter in agent mode

In our practice, Anthropic remains the second main working option, especially for reasoning tasks, architectural breakdowns, and difficult code episodes. However, in the scenario of using Claude Code as an AI agent, another kind of limitation appears: if the subscription limit is hit, the task may remain unfinished.

This is especially critical when the user works in vibe coding mode and expects completion of a whole action cycle. In that context, not only reasoning strength matters, but also the predictability of bringing the task to the finish.

So in this case Anthropic's limitation is tied primarily to the stability of the agent scenario under real usage conditions.

How we usually split OpenAI and Anthropic by task
ScenarioWhat is criticalWhom we take more often
Routine product and engineering workA stable general-purpose resultOpenAI or Anthropic depending on the task format
Strictly structured outputs and complex schemasQuality of JSON Schema handlingBe more careful with OpenAI if oneOf is critical
Long architectural breakdowns and difficult bugsDepth of reasoning and analysis qualityMore often Anthropic or senior OpenAI models
A long agent cycle with no right to stop halfwayPredictability of task completionWe look at usage-mode limitations in advance

Observation 3. Cheaper providers often lose on total economics

Alternative providers may look attractive on token price. Sometimes the cost difference is indeed large. But in real development, token price is not the main measure of efficiency.

If the model holds context worse, writes weaker code, makes more architectural mistakes, and requires more manual finishing, then the same product takes more time. As a result, the final cost grows at the level of human effort: more iterations, more fixes, more team time spent stabilizing the result.

That is why in our assessment cheap providers often lose on total economics. Formally the token is cheaper, but the cost of delivering a comparable result turns out higher.

Model recommendations

gpt-5.4

The base working model for most tasks. We recommend not going below medium.

gpt-5.4-high

A good choice when you need a stronger engineering and product result in day-to-day work.

sonnet-4.6 + thinking high-extra high

A strong working mode for a large class of engineering tasks where a stable result is needed without moving into the most expensive models.

gpt-5.5-extra high and opus-4.7

A targeted boosted mode for architecture and those bugs that other models fail to fix for a long time.

In applied work, our base recommendation looks like this: use gpt-5.4 not below medium. gpt-5.4-high and sonnet-4.6 + thinking high-extra high handle their working tasks well. gpt-5.5-extra high and opus-4.7 make sense as a boosted mode for architecture and those bugs that cannot be fixed for a long time by other models.

Bottom line

If you choose a provider by actual work quality, then as of 09.05.2026 our choice remains between OpenAI and Anthropic. It is between these two providers that the practical choice now lies for teams that need a working result in product, code, and agent scenarios.

That is why we use both approaches. We are not looking for one absolute winner. We choose the stack by task class, product constraints, and result requirements. In the current state of the market, this strategy gives the most stable practical outcome.

Article authorKhasan Mukhabbatov