Skip to content

Vision (See)

Vision (See)

The Vision feature lets you capture anything on your screen — a screenshot, a selected region, or a window — and ask an AI about it. The response can be injected at your cursor, copied to the clipboard, or sent to Quick Chat, without leaving your workflow.

TIP
Vision pairs naturally with your voice. After capturing, you can record a spoken question (e.g. *"What does this error mean?"* or *"Summarise the data in this table"*) before the AI processes the image.

Starting a Vision Capture

Press the Vision hotkey from any application:

Ctrl + Alt + S (Default)

The screen dims and a transparent snipping overlay appears with a hint bar at the bottom.

Capture Modes

InputWhat happens
Click on a windowCaptures that specific application window
Click and dragCaptures a custom rectangular region
Press FCaptures the full active monitor
Press ACaptures all monitors as one wide image
Press EscCancels and returns to your work

After making a selection the overlay closes and the Vision Action panel appears.

The Vision Action Panel

This step lets you type (or record) an optional question and choose what the AI should do with the screenshot.

Optional Query

The text field accepts your question. You can type it, or click the microphone button to record a voice query (up to 30 seconds). dIKta.me transcribes your question and fills the field automatically.

If you leave the field empty, the default query is used: "Describe what you see and extract any visible text."

Local / Cloud Toggle

Switch between Local (your configured Ollama vision model, runs on-device) and Cloud (Gemini, Claude, or OpenAI) for each individual capture.

Action Buttons

ButtonWhat it does
SaveSaves the screenshot to a file and copies it to the clipboard. No AI involved.
ClipboardSends the image + query to the AI and copies the response to your clipboard.
ChatAttaches the screenshot to Quick Chat so you can have a multi-turn conversation about it.
NoteRuns the vision pipeline and records a voice note appended to your notes file.
OCRExtracts all visible text from the screenshot exactly as it appears, and copies it to your clipboard.
TableExtracts tabular data as tab-separated values (TSV), ready to paste into Excel or Sheets.
ColorOpens the Color Picker on the captured screenshot.
RecordStarts a video recording of the selected region.
NOTE
**Table** always uses the cloud provider regardless of your Local/Cloud toggle, because local models produce unreliable structured output for this task.

Color Picker

Ctrl + Alt + C also opens the Color Picker directly, without going through the Vision overlay.

Once the overlay is open:

  • Move the mouse to see a live magnifier showing the exact pixel colour under the cursor, with its hex and RGB values.
  • Click to pick a colour. Picked colours accumulate in a palette strip at the bottom.
  • Backspace to undo the last pick.
  • Enter to finish and copy all picked colours to the clipboard.
  • Tab to finish and send the palette to the AI for a colour analysis.
  • Esc to cancel (if you haven't picked any colours yet) or finish with the current palette.

Video Recording (Capture Moments)

Start a recording from the Vision Action panel by clicking Record, or use your configured video hotkey.

The snipping overlay appears so you can select a region or full screen. Once confirmed, a compact floating bar appears at the top of the screen showing:

  • A blinking red dot and elapsed timer
  • A Pause / Resume button
  • A Stop button

Recording captures your screen, microphone audio, and system audio simultaneously. An optional webcam bubble (picture-in-picture, bottom-right corner) can be enabled in Settings.

The default maximum recording duration is 120 seconds.

After Recording

When you click Stop, a post-capture panel appears:

ButtonWhat it does
SaveSaves the MP4 file locally. No AI processing.
DescribeUploads the clip to Gemini and returns a description of what happened.
DocumentAsks Gemini to write step-by-step instructions for the actions shown.
Bug ReportAsks Gemini to produce a structured bug report based on what it sees.
ChatAttaches the clip to Quick Chat for a multi-turn conversation.
NOTE
Video AI actions require a cloud connection. The **Save** action always works offline.

Output

By default, Vision responses are injected at your cursor, exactly like Dictation. You can change the default per-action in Settings → Vision:

ModeBehaviour
Inject (default)Response is typed into the active window at the cursor.
ClipboardResponse is copied to the clipboard. A toast notification confirms.
Toast OnlyResponse is shown in a Windows notification. Nothing is written or copied.

Local Vision Models

If you have Ollama installed, you can process images entirely on your own machine. Recommended models:

ModelOllama commandVRAMBest for
minicpm-v (default)ollama pull minicpm-v~2 GBGeneral use, OCR, description
moondreamollama pull moondream~1.2 GBFast descriptions on low-VRAM hardware
llava-phi3ollama pull llava-phi3~2.5 GBStronger reasoning

Configure your local vision model in Settings → Vision → Local Vision Model.