Thoughts and insights on AI, product management, and technology

Ankur Shrivastava

Google Gemini 3: A Product Manager's Review of Google's Latest AI Model

Cover Image for Google Gemini 3: A Product Manager's Review of Google's Latest AI Model
Ankur Shrivastava
Ankur Shrivastava

As an AI Product Manager, I've been evaluating Google's Gemini 3 since its launch on November 18, 2025. After testing it extensively and comparing it against GPT-5.1 and Claude's latest models, here's my honest assessment of where Gemini 3 fits in your AI product stack, what it does exceptionally well, and where it falls short.


What is Gemini 3?

Gemini 3 is Google's latest flagship AI model, available in two variants:

Gemini 3 Pro - The everyday model optimized for general-purpose tasks, available immediately through Google AI Studio, Vertex AI, and the Gemini app.

Gemini 3 Deep Think - A specialized variant for complex reasoning tasks that require deeper contemplation, particularly strong in scientific and mathematical reasoning.

The model's core value proposition centers on three capabilities: best-in-class multimodal understanding (especially video), a unique "generative interfaces" feature that creates interactive visual outputs, and native integration with Google's ecosystem.


Performance: The Numbers That Matter

As a PM, I care about benchmarks only insofar as they predict real-world performance. Here's what stands out:

Mathematical and Scientific Reasoning

AIME 2025 (competition-level math): 95% without tools, 100% with code execution. This puts it roughly on par with GPT-5.1 (94%) and ahead of most competitors. For products requiring mathematical reasoning (financial modeling, scientific computing), this is production-ready.

GPQA Diamond (graduate-level science): 91.9% for Gemini 3 Pro, 93.8% for Deep Think. This is excellent—better than Claude Sonnet 4.5 (83.4%) and competitive with top models.

ARC-AGI-2 (abstract reasoning): Deep Think scores 45.1%, which is genuinely impressive for this notoriously difficult benchmark. This suggests strong generalization capabilities.

Multimodal Understanding

This is where Gemini 3 genuinely shines:

Video-MMMU: 87.6% - Industry-leading video understanding ScreenSpot-Pro (UI understanding): 72.7%, up from 11.4% in the previous version. This is a massive leap and makes it viable for UI automation use cases. MMMU-Pro (multimodal understanding): 81%

If your product involves video analysis, image understanding, or document processing, Gemini 3's multimodal capabilities are currently the best available.

Coding Capabilities

WebDev Arena: 1487 Elo rating (tops the leaderboard) Terminal-Bench 2.0: 54.2%

The coding performance is strong, though Claude Sonnet 4.5 (77.2% on SWE-bench Verified) still leads for autonomous software engineering tasks.


Standout Feature: Generative Interfaces

The most interesting product innovation in Gemini 3 is generative interfaces—the model can choose to respond not with text, but by creating interactive visual layouts. (MIT Technology Review)

What this means in practice:

  • Ask about loan options, and it might generate an interactive calculator
  • Request a comparison, and you get a structured table with filtering
  • Query physics concepts, and it creates interactive simulations

Product implications: This changes the UX paradigm from "chatbot" to "dynamic interface generator." For consumer products, this significantly improves information accessibility. For enterprise products, it reduces the need to build custom visualizations.

The catch: Interface quality is inconsistent—sometimes brilliant, sometimes basic. You can't fully control what format you'll get, which creates UX challenges. Users also need to discover this capability; it's not obvious from a chat interface.


Gemini Agent: The Agentic Vision

Google introduced Gemini Agent as an experimental feature that connects to Google apps (Gmail, Calendar, Drive) to handle multi-step workflows autonomously.

Current capabilities:

  • Email triage and draft responses
  • Calendar management
  • Multi-app research tasks
  • Travel planning

Product reality check:

  • Status: Experimental, US-only, requires Google AI Ultra subscription
  • Integration scope: Limited to Google apps
  • Maturity: Not production-ready for enterprise deployment

Compare this to Claude Sonnet 4.5's 30-hour autonomous runtime and 61.4% OSWorld score (generic computer use), and you'll see Gemini Agent is still early-stage. For product planning, treat this as a roadmap item rather than a current feature you can rely on.


Competitive Comparison: Where Gemini 3 Fits

Gemini 3 vs. GPT-5.1

GPT-5.1 (released November 13, 2025) positions itself around intelligent efficiency with adaptive reasoning that automatically allocates compute based on task complexity. (OpenAI)

Where Gemini 3 wins:

  • Multimodal capabilities: Gemini's video and image understanding significantly outperforms GPT-5.1
  • Context window: 1M tokens vs. GPT-5.1's undisclosed limit (likely smaller)
  • Generative interfaces: Unique to Gemini
  • Pricing: $2 input / $12 output per million tokens

Where GPT-5.1 wins:

  • Speed: 2-3× faster response times
  • Token efficiency: Automatic optimization reduces token usage by up to 88% on simple tasks, making it significantly cheaper in practice
  • Ecosystem: More third-party integrations and established developer tools

PM decision framework:

  • Choose Gemini 3 if your product is multimodal-heavy (video platforms, document analysis, visual content) or you're deeply integrated with Google Workspace
  • Choose GPT-5.1 if you need fast response times, cost efficiency across varied workloads, or extensive third-party integrations

Gemini 3 vs. Claude Sonnet 4.5

Claude Sonnet 4.5 (released September 29, 2025) is Anthropic's autonomous software engineering specialist with production-grade coding capabilities. (Anthropic)

Where Gemini 3 wins:

  • Multimodal breadth: Better video understanding and broader modality support
  • Cost: 25% cheaper ($2/$12 vs. $3/$15 per million tokens)
  • Context window: 1M tokens vs. Claude's 200K
  • Generative interfaces: Unique feature

Where Claude Sonnet 4.5 wins:

  • Autonomous coding: 82% on SWE-bench Verified with parallel compute—this is production-ready for real codebases
  • Computer use: 61.4% OSWorld score enables desktop automation
  • Autonomous runtime: 30 hours for long-running tasks
  • Code quality: 0% error rate on code editing benchmarks
  • Safety: 98.7% safety score with transparent metrics

PM decision framework:

  • Choose Gemini 3 for consumer-facing multimodal apps, cost-sensitive deployments, or Google ecosystem integration
  • Choose Claude Sonnet 4.5 for autonomous coding agents, desktop automation, or when enterprise safety compliance is critical

Gemini 3 Deep Think vs. Claude Opus 4.1

Claude Opus 4.1 (released August 5, 2025) is Anthropic's flagship reasoning model for complex agentic tasks. (Anthropic)

Where Gemini 3 Deep Think wins:

  • Abstract reasoning: 45.1% ARC-AGI-2 (frontier capability)
  • Scientific reasoning: 93.8% GPQA Diamond
  • Context capacity: 5× larger (1M vs. 200K tokens)
  • Pricing: Same as Pro ($2/$12) vs. Opus's premium pricing

Where Claude Opus 4.1 wins:

  • Production agentic systems: More mature tool use
  • Code refactoring: Superior multi-file capabilities
  • Enterprise adoption: Available on AWS Bedrock and GCP Vertex AI with established enterprise credibility

PM decision framework:

  • Choose Gemini 3 Deep Think for scientific/research applications, large document analysis, or when budget is constrained
  • Choose Claude Opus 4.1 for production agentic systems or when enterprise reliability is non-negotiable

Developer Experience

API Control Parameters

Gemini 3 offers two interesting control parameters:

thinking_level (low/high): Controls how much internal reasoning the model performs

  • Low: Faster, cheaper responses
  • High: Deeper reasoning for complex tasks

media_resolution (low/medium/high): Controls visual processing fidelity

  • Low: Fast processing, lower token usage
  • High: Maximum visual detail

Product perspective: This gives you explicit cost/quality trade-off controls, which is valuable for production deployments. However, GPT-5.1's automatic optimization means you don't need to think about this—it adapts automatically. Trade-off: control vs. convenience.

Context Window

The 1M token context window is genuinely useful for:

  • Analyzing entire codebases
  • Processing lengthy legal documents
  • Maintaining long conversation histories
  • Multi-document synthesis

This is a real differentiator—most competitors offer 128K-200K tokens.

Available Integrations

Direct access:

  • Google AI Studio (free tier available)
  • Vertex AI (enterprise)
  • Gemini API
  • Gemini CLI

Ecosystem integration:

  • Native Google Workspace integration
  • Google Search integration
  • Limited third-party integrations compared to OpenAI

Pricing

Gemini 3 Pro:

  • Input: $2 per million tokens
  • Output: $12 per million tokens
  • Free tier available in AI Studio

Competitive context:

  • 25% cheaper than Claude Sonnet 4.5 ($3/$15)
  • Comparable sticker price to GPT-5.1, but GPT's efficiency features may make it cheaper in practice
  • Same pricing for both Pro and Deep Think variants

Cost considerations: For multimodal-heavy workloads, Gemini 3's optimization makes it significantly cheaper than competitors. For text-heavy workloads, GPT-5.1's efficiency might offer better unit economics despite similar pricing.


Real-World Use Cases and Product Fit

Where Gemini 3 Excels

1. Content Moderation Platforms The video understanding capabilities make it ideal for automated content moderation. ScreenSpot-Pro's 72.7% score suggests it can handle UI-based moderation workflows.

2. Document-Heavy Enterprise Applications Legal tech, compliance, research tools—anywhere you need to analyze large volumes of multimodal documents. The 1M context window is a game-changer here.

3. Consumer Education Apps Generative interfaces make complex concepts more accessible through interactive visualizations. This is genuinely differentiated for education products.

4. Google Workspace Extensions If you're building on top of Google Workspace, native integration and Gemini Agent (once it matures) will provide seamless workflows your users already understand.

5. Video Analytics Platforms Media monitoring, surveillance, sports analytics—any product requiring video understanding at scale.

Where Competitors Are Better Choices

1. Autonomous Coding Agents Choose Claude Sonnet 4.5. Its 82% SWE-bench performance and 0% code editing error rate are production-ready. Gemini hasn't disclosed its SWE-bench scores, which is telling.

2. Desktop Automation Choose Claude Sonnet 4.5 with its 61.4% OSWorld score. It can actually control computer interfaces reliably.

3. Cost-Optimized Chatbots Choose GPT-5.1. Its automatic efficiency optimization will save you money on simple queries, which constitute the majority of chatbot interactions.

4. Safety-Critical Enterprise Applications Choose Claude family. Anthropic's transparent safety metrics (98.7% safety score) and compliance certifications provide necessary enterprise trust.

5. Rapid Prototyping Choose GPT-5.1. The 2-3× speed advantage accelerates development iteration cycles.


Product Limitations and Considerations

1. Ecosystem Lock-In Concerns

Gemini 3's tight Google integration is both a strength and a risk. Organizations wary of vendor lock-in may hesitate to adopt, even with superior multimodal capabilities. Cross-platform deployment is limited compared to Claude (available on AWS Bedrock, GCP Vertex AI) or GPT (Azure, various clouds).

2. Agent Maturity Gap

Gemini Agent's "experimental" status means you can't build production features on it yet. No SLAs, limited geographic availability, and uncertain roadmap. By contrast, Claude's agentic features are production-ready.

3. Safety Transparency

Google hasn't published safety benchmarks comparable to Anthropic's transparent metrics. For regulated industries (healthcare, finance), this lack of transparency is a blocker.

4. Code Automation Uncertainty

No disclosed SWE-bench scores suggest Gemini 3 trails Claude in autonomous coding. If your product roadmap includes AI coding features, plan accordingly.

5. Generative Interface Unpredictability

You can't control when the model generates an interface vs. text. This creates UX consistency challenges in production applications. You'll need fallback designs for both output types.


Who Should Use Gemini 3?

Strong Fit

Multimodal-first products: If video/image understanding is core to your value proposition, Gemini 3 is currently the best choice.

Google ecosystem users: Organizations already on Google Workspace get near-zero integration friction and will benefit most from Gemini Agent as it matures.

Cost-sensitive scale-ups: The 25% price advantage vs. Claude and multimodal optimization creates favorable unit economics for high-volume deployments.

Research and analysis tools: The 1M context window and strong scientific reasoning make it ideal for research-intensive applications.

Consumer-facing AI features: Generative interfaces provide differentiated UX for consumer products where exploration and discovery are valued.

Weak Fit

Autonomous coding products: Claude Sonnet 4.5's superior performance makes it the better choice.

Cross-platform requirements: Limited deployment options create vendor dependency.

Regulated industries: Lack of transparent safety metrics is a barrier.

Latency-critical applications: GPT-5.1's speed advantage matters for real-time use cases.

Desktop automation: Claude's computer use capabilities are more mature.


Technical Requirements

SDK Requirements:

  • Vertex AI: Gen AI SDK for Python version 1.51.0 or later
  • Google AI Studio: Browser-based, no SDK required
  • Standard REST API with OAuth authentication

Current Limitations:

  • Some features region-locked (Gemini Agent US-only)
  • Free tier has rate limits
  • Deep Think not widely available yet
  • No self-hosting option (cloud-only)

Looking Forward

Near-Term Expectations (3-6 months)

Gemini Agent maturation: Expect enterprise features (audit logs, SLAs) as Google moves from experimental to production. This is critical for B2B adoption.

Cost optimization features: Likely to see prompt caching and batch processing to compete with GPT-5.1's efficiency advantages.

Safety transparency: Google will need to publish safety benchmarks to compete for enterprise deals in regulated industries.

Long-Term Product Direction

Expanded modalities: Real-time video, audio generation, 3D understanding to extend the multimodal moat.

Cross-platform agents: Reducing Google-only integration constraints to expand addressable market.

Vertical-specific models: Healthcare, finance, legal variants with compliance certifications to capture high-value enterprise segments.


The Bottom Line

Gemini 3 is the best choice for multimodal-heavy applications, particularly those involving video, images, and documents. The generative interfaces feature is genuinely innovative, though implementation challenges remain. For organizations already invested in Google's ecosystem, it's a natural fit with improving integration over time.

However, it's not a universal solution. If your product priorities are autonomous coding, desktop automation, or maximum cost efficiency for text workloads, competitors offer better options. The experimental status of Gemini Agent means agentic features won't be production-ready for B2B use cases for several months.

As a PM, here's my framework: Match model capabilities to your product's critical path. Don't choose based on benchmark bragging rights—choose based on where the model's strengths align with your highest-value features. Gemini 3's multimodal excellence and generous context window make it exceptional for specific use cases, even if it's not the best all-around model.

The AI platform market is segmenting by use case, not consolidating around a single winner. Build a multi-model strategy where you use each platform for what it does best. Gemini 3 earns its place in that stack for multimodal workloads—and that's exactly where you should deploy it.