Agentic Vision

Agentic Vision

Agentic vision is a visual reasoning pattern where a model uses intermediate actions such as zooming, pointing, and code execution to inspect an image and compute a more accurate answer.

Key points

  • Google DeepMind says Gemini Robotics-ER 1.6 uses agentic vision to achieve accurate instrument readings [src-039].
  • The model first zooms into an image to read small gauge details, then uses pointing and code execution to estimate proportions and intervals [src-039].
  • It combines those intermediate steps with world knowledge to interpret the final meaning of the instrument reading [src-039].
  • In the reported benchmark, instrument reading evaluations were run with agentic vision enabled, except for Gemini Robotics-ER 1.5, which does not support it [src-039].
  • Agentic vision extends the ReAct Loop (Reason + Act) idea into perception: the model does not only classify an image once, it performs structured visual substeps before answering [src-039].
  • Back to Engineering's physical-AI cluster shows the practical data side of this pattern: robot vision and sensor systems need capture, replay, and inspection tooling before perception errors can be debugged [src-076].

Related entities

Related concepts

Source references

  • [src-039] Laura Graesser and Peng Xu — "Gemini Robotics-ER 1.6: Powering real-world robotics tasks through enhanced embodied reasoning" (2026-04-14)
  • [src-076] Back to Engineering (iulia) – physical AI, robotics, and data science cluster (41 videos, 2018-12-16 to 2026-05-10)