
Google Gemini’s multimodal update finally challenges GPT-5 on visual understanding. In June 2026, Google rolled out a major upgrade, targeting its longtime weak spot.
Until now, OpenAI held a clear edge in interpreting images, video and mixed media. However, that edge is starting to narrow.
What’s Actually New in Gemini’s Multimodal Update
First, Google pushed Gemini-3.1-flash-image and Gemini-3-pro-image to general availability in June. The update adds video-to-image generation, a fresh capability inside Gemini-3.1-flash-image.
Developers can now upload a video file or paste a YouTube link. Then, Gemini generates thumbnails, movie posters, or summary infographics automatically.
In addition, Google launched gemini-embedding-2, its first multimodal embedding model. The model maps text, images, video, audio, and PDFs into one shared space.
This upgrade powers File Search, which returns visual citations alongside text. Meanwhile, Google retired Imagen completely, shifting every workflow toward Gemini models.
How Gemini’s Multimodal Update Closes the Benchmark Gap
Following the rollout, independent benchmark testing confirms real progress. Gemini 3.1 Pro scores 82.8 on multimodal and grounded tasks. GPT-5.5 trails significantly, posting only 70.4 in the same category.
Additionally, MMMU-Pro produces the widest gap between the two models. Gemini also edges ahead on the overall leaderboard, scoring 89 to 88.
However, the margin stays thin enough to avoid calling it a defeat. Gemini wins decisively in vision tasks, yet barely wins overall.
GPT-5’s June Countermove
In response, OpenAI answered swiftly with its own preview release. The company unveiled GPT-5.6 Sol, Terra, and Luna on June 26.
However, OpenAI chose depth over breadth for the release. Sol pushes coding performance further, setting a new benchmark record on Terminal-Bench 2.1. The model also strengthens biology research and shows measurable cybersecurity gains via ExploitBench.
Furthermore, Terra matches GPT-5.5’s performance while costing half as much. Consequently, OpenAI built GPT-5.6 for reasoning and agentic work, not visual tasks. The choice reveals where each company sees its strongest ground.
Where the Gap Still Holds
Despite Gemini’s gains, GPT-5.5 keeps a clear lead in pure reasoning tasks. It averages 85 points against Gemini’s 77.1.
However, CritPt shows the sharpest divide between the two models and pricing tells a similar story. GPT-5.5 costs $5 per million input tokens and $30 per million output tokens. Gemini 3.1 Pro charges only $2 and $12 for the same workload.
Therefore, cost favors Google, while reasoning power still favors OpenAI. Buyers simply cannot pick one model and check every box.
Which Workflow Should Switch First
Overall, teams running document, image, and video pipelines gain the most right now. File search across mixed media now works inside one unified API. However, coding teams and security researchers should wait before switching models.
GPT-5.6 remains available only through a limited partner preview. OpenAI plans a broader rollout within weeks, not immediately. Until then, GPT-5.5 keeps its edge in reasoning-heavy production work.
Most teams will likely run both models, routing tasks by strength. Ultimately, Gemini closes a real gap in vision, while OpenAI still holds reasoning ground tightly.