VLM Vision Detection¶
When traditional locators fail, AXTerminator can use AI vision models to find elements by natural language description.
Supported Backends¶
| Backend | Speed | Privacy | Cost |
|---|---|---|---|
| MLX | Fast (~50ms) | Local | Free |
| Ollama | Medium (~200ms) | Local | Free |
| Anthropic | Slow (~1s) | Cloud | $$$ |
| OpenAI | Slow (~1s) | Cloud | $$$ |
| Gemini | Slow (~1s) | Cloud | $$ |
Configuration¶
import axterminator as ax
# Local MLX (recommended)
ax.configure_vlm(backend="mlx")
# Local Ollama
ax.configure_vlm(backend="ollama", model="llava")
# Claude Vision
ax.configure_vlm(
backend="anthropic",
api_key="sk-ant-..."
)
# OpenAI Vision
ax.configure_vlm(
backend="openai",
api_key="sk-..."
)
# Gemini Vision
ax.configure_vlm(
backend="gemini",
api_key="..."
)
Usage¶
# Natural language element description
button = app.find("the blue Save button in the toolbar")
# Works with complex descriptions
menu = app.find("the dropdown menu showing 'File' options")
icon = app.find("the red notification badge on the bell icon")
How It Works¶
- AXTerminator takes a screenshot of the app
- Sends it to the VLM with your description
- VLM returns bounding box coordinates
- AXTerminator maps coordinates to accessibility element
[Screenshot] + "Find the blue Save button"
↓
[VLM Model]
↓
{x: 450, y: 120, width: 80, height: 30}
↓
[Element Mapping]
↓
AXButton("Save")
Performance Tips¶
- Use MLX locally - 50ms vs 1s for cloud
- Be specific - "blue Save button in toolbar" > "save button"
- Use as fallback - Configure after other strategies