VLM Vision Detection¶

When traditional locators fail, AXTerminator can use AI vision models to find elements by natural language description.

Supported Backends¶

Backend	Speed	Privacy	Cost
MLX	Fast (~50ms)	Local	Free
Ollama	Medium (~200ms)	Local	Free
Anthropic	Slow (~1s)	Cloud	$$$
OpenAI	Slow (~1s)	Cloud	$$$
Gemini	Slow (~1s)	Cloud	$$

Configuration¶

import axterminator as ax

# Local MLX (recommended)
ax.configure_vlm(backend="mlx")

# Local Ollama
ax.configure_vlm(backend="ollama", model="llava")

# Claude Vision
ax.configure_vlm(
    backend="anthropic",
    api_key="sk-ant-..."
)

# OpenAI Vision
ax.configure_vlm(
    backend="openai",
    api_key="sk-..."
)

# Gemini Vision
ax.configure_vlm(
    backend="gemini",
    api_key="..."
)

Usage¶

# Natural language element description
button = app.find("the blue Save button in the toolbar")

# Works with complex descriptions
menu = app.find("the dropdown menu showing 'File' options")
icon = app.find("the red notification badge on the bell icon")

How It Works¶

AXTerminator takes a screenshot of the app
Sends it to the VLM with your description
VLM returns bounding box coordinates
AXTerminator maps coordinates to accessibility element

[Screenshot] + "Find the blue Save button"
        ↓
    [VLM Model]
        ↓
{x: 450, y: 120, width: 80, height: 30}
        ↓
    [Element Mapping]
        ↓
    AXButton("Save")

Performance Tips¶

Use MLX locally - 50ms vs 1s for cloud
Be specific - "blue Save button in toolbar" > "save button"
Use as fallback - Configure after other strategies

config = ax.HealingConfig(
    strategies=["data_testid", "title", "visual_vlm"]
)

Debugging¶

# Enable verbose VLM logging
ax.configure_vlm(backend="mlx", verbose=True)

# Manual detection
result = ax.detect_element_visual(
    app.screenshot(),
    "the search icon"
)
print(result)  # {"x": 100, "y": 50, "confidence": 0.95}