Agentic Vision
Agentic vision is a visual reasoning pattern where a model uses intermediate actions such as zooming, pointing, and code execution to inspect an image and compute a more accurate answer.
Key points
- Google DeepMind says Gemini Robotics-ER 1.6 uses agentic vision to achieve accurate instrument readings [src-039].
- The model first zooms into an image to read small gauge details, then uses pointing and code execution to estimate proportions and intervals [src-039].
- It combines those intermediate steps with world knowledge to interpret the final meaning of the instrument reading [src-039].
- In the reported benchmark, instrument reading evaluations were run with agentic vision enabled, except for Gemini Robotics-ER 1.5, which does not support it [src-039].
- Agentic vision extends the ReAct Loop (Reason + Act) idea into perception: the model does not only classify an image once, it performs structured visual substeps before answering [src-039].
- Back to Engineering's physical-AI cluster shows the practical data side of this pattern: robot vision and sensor systems need capture, replay, and inspection tooling before perception errors can be debugged [src-076].
Related entities
Related concepts
- Embodied Reasoning
- Robotic Instrument Reading
- Agentic AI
- ReAct Loop (Reason + Act)
- Context Quality Engineering
- Physical AI
- Robotics Data Loop
- Edge Robotics