Gemini's Multimodal Update vs GPT-5: Gaps Explained

Google Gemini’s multimodal update finally challenges GPT-5 on visual understanding. In June 2026, Google rolled out a major upgrade, targeting its longtime weak spot.

Until now, OpenAI held a clear edge in interpreting images, video and mixed media. However, that edge is starting to narrow.

What’s Actually New in Gemini’s Multimodal Update

First, Google pushed Gemini-3.1-flash-image and Gemini-3-pro-image to general availability in June. The update adds video-to-image generation, a fresh capability inside Gemini-3.1-flash-image.

Developers can now upload a video file or paste a YouTube link. Then, Gemini generates thumbnails, movie posters, or summary infographics automatically.

In addition, Google launched gemini-embedding-2, its first multimodal embedding model. The model maps text, images, video, audio, and PDFs into one shared space.

This upgrade powers File Search, which returns visual citations alongside text. Meanwhile, Google retired Imagen completely, shifting every workflow toward Gemini models.

How Gemini’s Multimodal Update Closes the Benchmark Gap

Following the rollout, independent benchmark testing confirms real progress. Gemini 3.1 Pro scores 82.8 on multimodal and grounded tasks. GPT-5.5 trails significantly, posting only 70.4 in the same category.

Additionally, MMMU-Pro produces the widest gap between the two models. Gemini also edges ahead on the overall leaderboard, scoring 89 to 88.

However, the margin stays thin enough to avoid calling it a defeat. Gemini wins decisively in vision tasks, yet barely wins overall.

GPT-5’s June Countermove

In response, OpenAI answered swiftly with its own preview release. The company unveiled GPT-5.6 Sol, Terra, and Luna on June 26.

However, OpenAI chose depth over breadth for the release. Sol pushes coding performance further, setting a new benchmark record on Terminal-Bench 2.1. The model also strengthens biology research and shows measurable cybersecurity gains via ExploitBench.

Furthermore, Terra matches GPT-5.5’s performance while costing half as much. Consequently, OpenAI built GPT-5.6 for reasoning and agentic work, not visual tasks. The choice reveals where each company sees its strongest ground.

Where the Gap Still Holds

Despite Gemini’s gains, GPT-5.5 keeps a clear lead in pure reasoning tasks. It averages 85 points against Gemini’s 77.1.

However, CritPt shows the sharpest divide between the two models and pricing tells a similar story. GPT-5.5 costs $5 per million input tokens and $30 per million output tokens. Gemini 3.1 Pro charges only $2 and $12 for the same workload.

Therefore, cost favors Google, while reasoning power still favors OpenAI. Buyers simply cannot pick one model and check every box.

Which Workflow Should Switch First

Overall, teams running document, image, and video pipelines gain the most right now. File search across mixed media now works inside one unified API. However, coding teams and security researchers should wait before switching models.

GPT-5.6 remains available only through a limited partner preview. OpenAI plans a broader rollout within weeks, not immediately. Until then, GPT-5.5 keeps its edge in reasoning-heavy production work.

Most teams will likely run both models, routing tasks by strength. Ultimately, Gemini closes a real gap in vision, while OpenAI still holds reasoning ground tightly.

What's Hot

Tesla Is Teaching Its Self-Driving AI With Millions of Fake Crashes. Here Is Why That Might Make Real Roads Safer

Why Gemini’s Multimodal Update Could Finally Close the Gap With GPT-5 — and Where It Still Falls Short

Anthropic Says Chinese Rival Alibaba Copied Claude at Scale. Here Is What Model Extraction Actually Means and Why It Matters

Why Gemini’s Multimodal Update Could Finally Close the Gap With GPT-5 — and Where It Still Falls Short

Tesla Is Teaching Its Self-Driving AI With Millions of Fake Crashes. Here Is Why That Might Make Real Roads Safer

Anthropic Says Chinese Rival Alibaba Copied Claude at Scale. Here Is What Model Extraction Actually Means and Why It Matters

OpenAI Just Built the One Thing That Could Make It Stop Depending on Nvidia. Here Is What Its First Custom Chip Does

Our Picks

Most Popular

Coinbase responds to hack: customer impact and official statement

Anthropic Will Use Claude User Chats For Data Training

Cursor AI Hits 1 Million Daily Users. Why Developers Are Switching to This Coding Tool

Stay Ahead with Exclusive Updates!

What's Hot

Why Gemini’s Multimodal Update Could Finally Close the Gap With GPT-5 — and Where It Still Falls Short

What’s Actually New in Gemini’s Multimodal Update

How Gemini’s Multimodal Update Closes the Benchmark Gap

GPT-5’s June Countermove

Where the Gap Still Holds

Which Workflow Should Switch First

Related Posts