OpenAI Whisper — OpenClaw Skill

What This Skill Does

The OpenAI Whisper skill gives your OpenClaw agent local speech-to-text transcription powered by OpenAI's Whisper model. Your agent can transcribe audio files in any format (MP3, M4A, WAV, OGG, and more), translate non-English audio to English, and output results as plain text, SRT subtitles, or VTT captions. Everything runs locally on your machine -- no API key or internet connection is needed after the initial model download.

Whisper supports multiple model sizes from tiny (fastest, least accurate) to large (slowest, most accurate), with turbo as the default on most installations. Models are downloaded once to ~/.cache/whisper and cached for future use. This makes it ideal for transcribing meeting recordings, voice memos, podcast episodes, or any audio content without sending data to external servers.

For faster cloud-based transcription when privacy is less of a concern, see the Whisper API skill. For capturing audio from cameras, pair this with the CamSnap skill.

Example Prompts

Transcribe the meeting recording at ~/Downloads/meeting.mp3 and save it as a text file

Translate this Spanish audio file to English text

Generate SRT subtitles for the podcast episode at /tmp/episode.m4a

Transcribe this voice memo using the medium model for better accuracy

Convert the audio from my camera clip to text and summarize what was said

Transcribe all the .m4a files in ~/Voice Memos/ and put the transcripts in ~/Transcripts/

Requirements

Binary dependency: whisper

macOS: brew install openai-whisper
Storage: Models download to ~/.cache/whisper (74MB to 2.9GB depending on model size)
No API key required: Runs entirely locally

Setup on KiwiClaw

OpenAI Whisper is pre-installed on all KiwiClaw tenant machines with the turbo model ready to use. Upload audio files or point your agent to a file path and transcription starts immediately. No API key or configuration needed. Manage files from the KiwiClaw dashboard.

Setup Self-Hosted

Install Whisper: brew install openai-whisper
Verify: whisper --help
Test: whisper /path/to/audio.mp3 --model turbo --output_format txt
The model downloads automatically on first run (~1.5GB for turbo)
Use --model tiny for faster but less accurate transcription

Related Skills

OpenAI Whisper API -- cloud-based transcription for faster results
CamSnap -- capture camera clips for audio transcription
Bear Notes -- save transcripts to Bear for reference
Himalaya -- email transcripts to colleagues

FAQ

What can the OpenAI Whisper skill do in OpenClaw?

The Whisper skill lets your OpenClaw agent transcribe audio and video files locally using OpenAI's Whisper model. It supports multiple languages, translation to English, and output formats including plain text, SRT subtitles, and VTT captions.

Does the Whisper skill require an API key?

No. The local Whisper CLI runs entirely on your machine with no API key needed. Models are downloaded once to ~/.cache/whisper and run offline after that. For cloud-based transcription, see the Whisper API skill. Compare costs in our self-hosting cost guide.

How is this different from the Whisper API skill?

The Whisper (local) skill runs the model directly on your machine -- no internet connection or API key needed. The Whisper API skill sends audio to OpenAI's servers for transcription, which is faster but requires an API key and internet access.

Is the OpenAI Whisper skill safe to use?

The Whisper skill has been security-vetted by KiwiClaw. It processes audio entirely locally -- no data leaves your machine. The model weights are downloaded from OpenAI's official source on first use.