The gap between the leading large language models has narrowed to a degree that would have seemed unlikely eighteen months ago. As of late June 2026, the flagship models from Anthropic, OpenAI and Google are separated by single percentage points on the benchmarks most commonly used to compare them, and by roughly 2,500 basis points on the LMArena human-preference leaderboard, which ranks models using more than 6.8 million blind pairwise votes across more than 360 models. Claude Opus 4.8 holds the top accessible position at approximately 1,510 Elo, with GPT-5.5 Pro and Gemini 3.1 Pro Preview close behind. A year earlier, the spread between the first and fifth-ranked models on the same leaderboard was several times wider, according to the same tracking.

Release timeline

The current generation of flagship models arrived in quick succession during spring 2026. OpenAI released GPT-5.5 on April 23, 2026, with API availability the following day, and promoted it to the default ChatGPT model on May 5. The architecture is natively omnimodal, meaning text, image, audio and video are processed through a single model rather than a stack of separate systems bolted together. At launch, independent tracker llm-stats recorded GPT-5.5 scoring 88.6 percent on SWE-bench Verified, 69.2 percent on SWE-bench Pro and approximately 88.7 percent on MMLU, alongside a 60 percent reduction in hallucination rate relative to GPT-5.4.

Anthropic followed with Claude Opus 4.8, shipped May 28, 2026 as a direct upgrade to Opus 4.7 at unchanged pricing. The model is built around hybrid extended thinking, computer-use capability and integration with Claude Code, and its principal strength is coding: it leads the field on the harder SWE-bench Pro benchmark and holds first position on the LMArena coding leaderboard. Notably, Anthropic held the context window at a standard 200,000 tokens, prioritising reasoning quality and tool reliability over raw context length.

Google's contribution to the current cycle is Gemini 3.1 Pro, which launched February 19, 2026 and, according to independent evaluators, topped thirteen of sixteen major benchmarks at the time, including 80.6 percent on SWE-bench, 94.3 percent on GPQA Diamond, the highest recorded score for any model on that test, and 77.1 percent on ARC-AGI-2. Google has since layered faster variants on top of the Pro model; Gemini 3.5 Flash launched at Google I/O 2026 and recorded 84.2 percent on MMMU-Pro, the highest multimodal score recorded by evaluator Artificial Analysis at that point.

DeepSeek's most recent release sits apart from the other three in both architecture and business model. DeepSeek V4 was released April 24, 2026 as two preview models, V4-Pro and V4-Flash, both released under the MIT license with weights available for download from Hugging Face and ModelScope. V4-Pro carries 1.6 trillion total parameters with 49 billion activated per token, while V4-Flash carries 284 billion total parameters with 13 billion activated; both support a one-million-token context window by default. The architecture shift responsible for that context expansion is a move away from DeepSeek's earlier attention mechanism; a hybrid attention design combining compressed sparse attention and heavily compressed attention allows V4-Pro to require only 27 percent of the single-token inference compute and 10 percent of the key-value cache of its predecessor, V3.2, at the one-million-token setting.

Independent verification versus lab-reported figures

Self-reported benchmarks and independently verified ones do not always agree, and the divergence is largest for DeepSeek. The U.S. Center for AI Standards and Innovation, part of NIST, published a formal evaluation of DeepSeek V4 Pro in May 2026. CAISI's evaluation found that DeepSeek V4 scores better on the company's own self-reported evaluations than on CAISI's independent testing; according to DeepSeek's data the model is roughly as capable as Opus 4.6 and GPT-5.4, both released around two months earlier, but CAISI's evaluation, which incorporated non-public benchmarks, placed V4's performance closer to GPT-5, a model released roughly eight months prior. CAISI's broader finding was that the capability gap between the most advanced Chinese and U.S. models runs at approximately eight months, based on sixteen benchmarks across thirty-five models. On cost, the same evaluation found DeepSeek's efficiency advantage real but narrower than marketing claims suggest: compared with the most cost-competitive U.S. reference model, GPT-5.4 mini, DeepSeek V4 was more cost-efficient on five of seven tested benchmarks, ranging from 53 percent cheaper to 41 percent more expensive depending on the task.

On the specific benchmark most commonly cited for coding capability, the gap to the closed frontier is measurable. DeepSeek V4-Pro-Max scores 80.6 percent on SWE-bench Verified according to the llm-stats tracker, the highest result among open-weight models and tied with Gemini 3.1 Pro, but 8 points behind Claude Opus 4.8's 88.6 percent.

Pricing

The pricing spread across the four labs is the widest differentiator once benchmark scores converge. DeepSeek's official API lists V4-Flash at $0.14 per million input tokens and $0.28 per million output tokens, and V4-Pro at $0.435 per million input tokens and $0.87 per million output tokens. By comparison, Anthropic's Claude Opus 4.7 (the direct predecessor to 4.8) held Opus 4.6 pricing at $5 per million input tokens and $25 per million output tokens, while Google listed Gemini 3.1 Pro Preview at $2 input and $12 output per million tokens for prompts up to 200,000 tokens, rising to $4 and $18 above that threshold. Modelled against a realistic monthly workload of 100 million input tokens and 20 million output tokens, one industry analysis calculated a cost of approximately $1,100 using GPT-5.5 pricing versus $243.60 using DeepSeek V4, a reduction of roughly 78 percent for comparable output on that class of task, though the analysis noted this does not account for the benchmark gap on harder reasoning and agentic tasks.

Where each model currently leads

No single model wins across every category as of this writing. Independent testing summarised by Build Fast with AI found that GPT-5.5 leads on tasks requiring deep reasoning, including ARC-AGI-2 and long-context retrieval; Gemini's Flash tier leads on multimodal benchmarks and orchestrated multi-tool workflows; and Claude Opus 4.7 was, at the time of that comparison, the only model leading the hardest software engineering benchmark, SWE-Bench Pro, at 64.3 percent. Anthropic's Opus 4.8, released after that comparison, has since extended the coding lead further, per the LMArena tracking cited above.

Context window remains a structural point of difference rather than a simple performance metric. Google's Gemini line and DeepSeek's V4 both default to very large context windows (Gemini past one million tokens in its higher tiers, DeepSeek V4 fixed at one million as standard), while Anthropic has kept Claude's context window comparatively conservative on the reasoning that tool reliability and instruction-following degrade at scale in ways that raw window size does not fix.

A note on availability

Not every model in this category has been continuously available throughout the period covered here. Anthropic's Claude Fable 5 and Claude Mythos 5, released after Opus 4.8, were briefly taken offline; the U.S. Department of Commerce imposed export controls that suspended both models on June 12, 2026, before lifting those controls on June 30 and prompting Anthropic to restore Fable 5 across its consumer and developer products on July 1. Mythos 5 remains restricted to approved partners. The episode is a reminder that regulatory exposure, not just benchmark performance, increasingly shapes which models are actually reachable at a given moment.

Outlook

DeepSeek has signalled further iteration is coming. The company has indicated a formal, non-preview version of V4 is expected in the third quarter of 2026, and reports at launch indicated DeepSeek was raising external investment from Tencent and Alibaba totalling approximately $1.8 billion, primarily to establish a valuation reference for employee equity rather than to meet an immediate capital need. Given the pace of releases already observed in 2026, from three major flagship launches within a single April fortnight to Google's subsequent Flash-tier follow-up, the current standings are unlikely to hold for long. The clearer trend, independent of which lab is nominally ahead in a given month, is that benchmark performance across the top tier has compressed to the point where cost, context window and task-specific reliability now do more to determine model choice than raw leaderboard position.

Sources

  • LM Council, AI Model Benchmarks, July 2026 (lmcouncil.ai/benchmarks)

  • Tech Insider, "Claude vs ChatGPT vs Gemini 2026: 88% SWE-Bench, $2 API" (tech-insider.org)

  • Tech Insider, "ChatGPT vs Claude vs Gemini vs DeepSeek [2026]" (tech-insider.org)

  • Spectrum AI Lab, "Claude Opus 4.7 vs GPT-5.2 vs Gemini 3.1 Pro vs DeepSeek V4 [2026]" (spectrumailab.com)

  • Build Fast with AI, "Gemini 3.5 Flash vs GPT-5.5 vs Claude vs DeepSeek (2026)" (buildfastwithai.com)

  • DeepSeek API Docs, "DeepSeek V4 Preview Release" (api-docs.deepseek.com)

  • DeepSeek-AI, DeepSeek-V4-Pro model card, Hugging Face (huggingface.co)

  • Morph, "DeepSeek V4: 1.6T MoE, 1M Context, $0.87/M Output" (morphllm.com)

  • MindStudio, "DeepSeek V4 Launch: 4 Specs" and "DeepSeek V4: The Open-Source Model That Rivals Closed Frontier Models" (mindstudio.ai)

  • DataCamp, "DeepSeek V4: Features, Benchmarks, and Comparisons" (datacamp.com)

  • AI Stack Choice, "DeepSeek V4 Review 2026" (aistackchoice.com)

  • U.S. Center for AI Standards and Innovation (CAISI) / NIST, "CAISI Evaluation of DeepSeek V4 Pro," May 2026 (nist.gov)

  • BenchLM.ai, DeepSeek V4 Pro benchmark profile (benchlm.ai)

  • Anthropic, statement on Claude Fable 5 and Mythos 5 export control restoration (anthropic.com/news)

Keep Reading