gpt-5.4
The base working model for most tasks. We recommend not going below medium.
As of 09.05.2026, our working choice remains between OpenAI and Anthropic. This conclusion is based on practical use of models in product development, programming, architecture, and agent execution.
When choosing an LLM provider, teams often look at token price, public benchmarks, or brand volume. For real product and engineering work that is not enough. In a working environment, other things matter more: result quality, stability on long tasks, behavior in agent scenarios, handling of structured outputs, and the amount of manual rework needed afterward.
As of 09.05.2026, our practical choice remains between OpenAI and Anthropic. We use both approaches because this is where the main working compromise lies today. Other providers sometimes offer a much lower token price, but by overall output quality they still lag enough that the total cost of delivering the product ends up higher.
The conclusions in this piece are based on practical use of models across several classes of tasks: product development, day-to-day programming, bug fixing, architecture work, handling formal response schemas, and agent scenarios in which the model has to do more than answer - it has to carry a work cycle to completion.
We deliberately do not treat this as an academic benchmark. This is an applied assessment of providers by usefulness in real work: how much quality output you get per unit of time, how much manual finishing is needed after the model's answer, and how predictably the system behaves over long cycles.
If you want the short version, it is this: OpenAI and Anthropic remain the two main working options today. Each has its own limitations, but these two still provide the strongest combination of quality, stability, and practical usefulness.
| Provider | Plus | Minus | Our conclusion |
|---|---|---|---|
| OpenAI | A strong general-purpose working stack for most tasks. | Does not support oneOf, which makes JSON Schema work poorly in more complex scenarios. | One of the two main working options. |
| Anthropic | Very strong in reasoning, architecture, and difficult code investigations. | If Claude Code is used as an AI agent, hitting the subscription limit can leave the task unfinished when the user is vibe coding. | The second main working option. |
| Other providers | Sometimes noticeably cheaper per token. | Quality is weaker on average, which increases rework and the final cost of the result. | We do not put them into the main stack yet. |
In our practice, OpenAI remains one of the two main working options for everyday product and engineering tasks. However, it has one specific limitation that materially affects integration quality in systems-oriented scenarios: the lack of oneOf support.
In practice this degrades work with JSON Schema where the response schema is truly complex. The problem becomes visible in systems where the model output has to validate against a strict machine schema with branching formats, typed workflows, and downstream automated processing.
In other words, the limitation shows up most clearly in infrastructure. The more orchestration, validation, and downstream processing there is in the system, the more this limitation influences architectural choices.
In our practice, Anthropic remains the second main working option, especially for reasoning tasks, architectural breakdowns, and difficult code episodes. However, in the scenario of using Claude Code as an AI agent, another kind of limitation appears: if the subscription limit is hit, the task may remain unfinished.
This is especially critical when the user works in vibe coding mode and expects completion of a whole action cycle. In that context, not only reasoning strength matters, but also the predictability of bringing the task to the finish.
So in this case Anthropic's limitation is tied primarily to the stability of the agent scenario under real usage conditions.
| Scenario | What is critical | Whom we take more often |
|---|---|---|
| Routine product and engineering work | A stable general-purpose result | OpenAI or Anthropic depending on the task format |
| Strictly structured outputs and complex schemas | Quality of JSON Schema handling | Be more careful with OpenAI if oneOf is critical |
| Long architectural breakdowns and difficult bugs | Depth of reasoning and analysis quality | More often Anthropic or senior OpenAI models |
| A long agent cycle with no right to stop halfway | Predictability of task completion | We look at usage-mode limitations in advance |
Alternative providers may look attractive on token price. Sometimes the cost difference is indeed large. But in real development, token price is not the main measure of efficiency.
If the model holds context worse, writes weaker code, makes more architectural mistakes, and requires more manual finishing, then the same product takes more time. As a result, the final cost grows at the level of human effort: more iterations, more fixes, more team time spent stabilizing the result.
That is why in our assessment cheap providers often lose on total economics. Formally the token is cheaper, but the cost of delivering a comparable result turns out higher.
The base working model for most tasks. We recommend not going below medium.
A good choice when you need a stronger engineering and product result in day-to-day work.
A strong working mode for a large class of engineering tasks where a stable result is needed without moving into the most expensive models.
A targeted boosted mode for architecture and those bugs that other models fail to fix for a long time.
In applied work, our base recommendation looks like this: use gpt-5.4 not below medium. gpt-5.4-high and sonnet-4.6 + thinking high-extra high handle their working tasks well. gpt-5.5-extra high and opus-4.7 make sense as a boosted mode for architecture and those bugs that cannot be fixed for a long time by other models.
If you choose a provider by actual work quality, then as of 09.05.2026 our choice remains between OpenAI and Anthropic. It is between these two providers that the practical choice now lies for teams that need a working result in product, code, and agent scenarios.
That is why we use both approaches. We are not looking for one absolute winner. We choose the stack by task class, product constraints, and result requirements. In the current state of the market, this strategy gives the most stable practical outcome.