Agentic Vision

Agentic vision is a visual reasoning pattern where a model uses intermediate actions such as zooming, pointing, and code execution to inspect an image and compute a more accurate answer.

Key points

Google DeepMind says Gemini Robotics-ER 1.6 uses agentic vision to achieve accurate instrument readings ^[src-039].
The model first zooms into an image to read small gauge details, then uses pointing and code execution to estimate proportions and intervals ^[src-039].
It combines those intermediate steps with world knowledge to interpret the final meaning of the instrument reading ^[src-039].
In the reported benchmark, instrument reading evaluations were run with agentic vision enabled, except for Gemini Robotics-ER 1.5, which does not support it ^[src-039].
Agentic vision extends the ReAct Loop (Reason + Act) idea into perception: the model does not only classify an image once, it performs structured visual substeps before answering ^[src-039].
Back to Engineering's physical-AI cluster shows the practical data side of this pattern: robot vision and sensor systems need capture, replay, and inspection tooling before perception errors can be debugged ^[src-076].

Related entities

Related concepts

Source references

^[src-039] Laura Graesser and Peng Xu — "Gemini Robotics-ER 1.6: Powering real-world robotics tasks through enhanced embodied reasoning" (2026-04-14)
^[src-076] Back to Engineering (iulia) – physical AI, robotics, and data science cluster (41 videos, 2018-12-16 to 2026-05-10)

Agentic Vision

Agentic Vision

Key points

Related entities

Related concepts

Source references

Explore Robin's AI portfolio

Recent posts

Archive

Tags

Senior AI product leadership

Robin Cartier

Company

Services