OpenAI released GPT 5.4 on Monday alongside a redesigned ChatGPT that the company is calling the ChatGPT super app. The model and the app are being treated as two halves of the same launch, which is new for OpenAI. In previous generations, the model landed first and the product caught up later. With 5.4 the product is the point of the release as much as the benchmark numbers, and the benchmark numbers are significant in their own right.
The headline capability is the context window. GPT 5.4 ships with a context window of one million tokens, a tenfold jump from the 100K window the previous generation shipped with at general availability. In practical terms that is enough context to hold an entire midsize codebase, a long legal brief with full exhibits, or the complete run of a real user conversation across months of interaction. The jump is not quite unprecedented. Google's Gemini has been shipping long context models for over a year, and a handful of specialized systems have reached similar numbers. It is the first time OpenAI has matched that territory with a general purpose flagship, and it closes a gap that has been visible in enterprise pipelines for most of 2025.
The Benchmarks That Matter
The benchmark OpenAI chose to lead with is OSWorld V, a 2026 successor to the OSWorld computer use benchmark that measures a model's ability to execute long running tasks inside a simulated desktop environment. GPT 5.4 scores 75 percent on the public split of OSWorld V, which is a jump of roughly fifteen points over the previous state of the art and a signal that autonomous multi step workflows are moving from research demos into shipping products. The model also shows gains on the usual suspects: MMLU, GPQA, HumanEval, and the MATH benchmark, though the deltas there are smaller and less newsworthy than they used to be as the field approaches saturation on those tests.
The more interesting numbers are the ones OpenAI disclosed about agentic workloads specifically. GPT 5.4 can plan and execute sequences of tool calls with a success rate that, according to OpenAI's published evaluations, now exceeds the rate at which a trained human operator would complete the same task working alone. That claim needs independent verification and will get it over the next few weeks as the research community runs its own evaluations. If it holds, it is the clearest sign yet that the frontier models have crossed from assistive to genuinely autonomous on a class of workflows that matter to enterprise buyers.
The Super App Is the Strategy
The ChatGPT super app is the product framing OpenAI has chosen for the release, and it is worth taking seriously as a strategic move rather than a marketing flourish. Previously, ChatGPT had separate surfaces for chat, code, image generation, voice, and the various agents OpenAI had rolled out in fits and starts. Users had to know which product to open for which task. In the super app, all of those surfaces collapse into a single interface. The model decides which capability to activate based on the prompt, and the user does not have to care. For 900 million weekly active users, most of whom are not building agents or writing code, the collapse is a real usability improvement.
The strategic reading is that OpenAI is consolidating its consumer surface before the rest of the field can build comparable products around the same capability. Anthropic has Claude, and Claude is excellent, but Anthropic has never tried to be a consumer super app. Google has Gemini inside every Google product, but the integration has been more workmanlike than ambitious. xAI has Grok, which is more of a chat product than anything else. In that field, a single unified ChatGPT surface with all of OpenAI's capabilities behind one door is a reasonable defensive play, and it is the move that best leverages the distribution advantage ChatGPT already has.
What Comes Next
The next chapter of the story is what GPT 5.4 enables in the enterprise channel. OpenAI's Azure and direct API revenue has been the fastest growing part of the business, and the kinds of workloads that justify a million token context window are overwhelmingly enterprise workloads: long document analysis, codebase wide refactoring, multi session customer support, and the categories of agent work that take a model hours to complete. GPT 5.4 is a product that is visibly designed to win those workloads, and the super app is the visible part of a strategy that has a much bigger iceberg underneath. OpenAI's $122 billion funding round, closed three days after the GPT 5.4 launch, is the other visible part of the same strategy. The two announcements sit together for a reason.