Skip to content

meow:multimodal

Image, video, audio, and document analysis via Gemini API with script-driven processing and context budget management.

What This Skill Does

Claude can't natively process binary media files — images, video, audio, PDFs. meow:multimodal bridges this gap by sending files to Google Gemini's multimodal API and returning structured analysis. Python scripts handle the API calls, file uploads, and response parsing. Claude only reviews the final analysis. The skill also supports image generation (Imagen 4) and video generation (Veo 3).

Core Capabilities

  • Analysis — Screenshot description, mockup analysis, audio transcription, video scene detection, PDF text extraction, OCR
  • Generation — Image creation via Imagen 4, video creation via Veo 3.1
  • Smart model selectiongemini-2.5-flash for analysis (cost-effective), imagen-4.0-generate-001 for images, veo-3.1-generate-preview for video
  • Size handling — Files <20MB inline, >20MB via File API (up to 2GB)
  • Audio splitting guidance — For transcripts >15 min, recommends splitting to avoid truncation
  • Context budget — 3000 tokens inline max. Overflow to .claude/memory/multimodal-cache/

Requires GEMINI_API_KEY

This skill requires a Google Gemini API key. Get one at aistudio.google.com/apikey. The skill will stop immediately if the key is missing — it never attempts an API call without it.

When to Use This

Use meow:multimodal when...

  • You need to analyze a screenshot, mockup, or diagram
  • You need to transcribe a meeting recording or podcast
  • You need to extract text from a PDF or scanned document
  • You need to generate an image or video from a description

Usage

bash
# Analyze an image
/meow:multimodal screenshot.png analyze

# Transcribe audio
/meow:multimodal meeting.mp3 transcribe

# Extract text from PDF
/meow:multimodal document.pdf extract

# Generate an image
/meow:multimodal --generate "sunset over mountains"

Example Prompts

PromptModel usedOutput
analyze this screenshot + image pathgemini-2.5-flashStructured description of UI elements
transcribe this meeting + audio pathgemini-2.5-flashTimestamped transcript [HH:MM:SS]
extract text from this PDF + PDF pathgemini-2.5-flashMarkdown-formatted content
generate image of a logoimagen-4.0-generate-001Generated PNG file

Quick Workflow

File path → API Key Check (STOP if missing)
  → Detect modality (image/video/audio/PDF)
  → Check file size (inline <20MB, File API >20MB)
  → Select model → Execute Python script
  → Budget check (≤3000 tokens inline, else write to cache)
  → Return structured analysis

Skill Details

Phase: any

Released under the MIT License.