Video to Subtitles (Speech to Text)

Generate SRT, VTT, or plain-text subtitles from any video or audio by transcribing the speech locally in your browser with Whisper — no uploads.

Loading tool…

Video to Subtitles (Speech to Text)Turn a video's spoken audio into ready-to-use subtitles without uploading anything. Drop in an MP4, MOV, WebM, MKV or an audio file and this tool extracts the sound, runs OpenAI's Whisper speech-recognition model right inside your browser, and produces timestamped SRT, WebVTT, or plain-text captions you can edit and download. The audio never leaves your device — only the open-source model weights are fetched once from a public CDN — so your recordings stay completely private.

What is Video to Subtitles (Speech to Text)?

A free, private video-to-subtitles generator that transcribes speech into timestamped captions entirely in your browser. It uses ffmpeg compiled to WebAssembly to pull 16 kHz audio out of your video, then runs the multilingual Whisper model (via WebGPU when available, otherwise the CPU) to recognise the speech and place it on a timeline. Creators, editors, students and accessibility teams use it to caption interviews, lectures, tutorials and social clips in 90+ languages — including Korean, Japanese and Chinese — and to export SRT for video editors, WebVTT for the web, or a clean text transcript. Pick a model size to trade speed for accuracy, auto-detect the language or set it, optionally translate the speech to English, then fix any line in the built-in editor before downloading.

How to use Video to Subtitles (Speech to Text)

  1. Drop a video or audio file onto the dropzone, or click to pick one. Nothing is uploaded — the file is read locally.
  2. Choose a model: Tiny for speed, Small for a balance (recommended), or Turbo for the best accuracy. Larger models download more data the first time.
  3. Leave the language on Auto-detect, or pick the spoken language to improve accuracy. Turn on Translate to English if you want English subtitles from other-language speech.
  4. Press Generate subtitles. On the first run the model downloads once (then it is cached); the audio is extracted and transcribed in your browser.
  5. Pick SRT, VTT or Text, edit any line to fix wording, preview the captions on the video, then download the subtitle file.

Examples

Caption a Korean interview as an SRT file

Drop the clip, keep the model on Small (or Turbo for cleaner Korean), leave language on Auto-detect, and export a timestamped .srt to load into your video editor.

Make WebVTT captions for a web video

Generate subtitles, switch the format to VTT, and download a .vtt file you can attach to an HTML5 <video> with a <track> element for accessible playback.

Translate a Japanese lecture into English subtitles

Turn on Translate to English before generating, and Whisper outputs English captions timed to the original speech — handy for sharing talks with a wider audience.

Frequently asked questions

Is my video uploaded to a server?
No. Both steps run 100% in your browser: ffmpeg (WebAssembly) extracts the audio and Whisper transcribes it on your device. Your media never leaves your computer. The only network request is a one-time download of the open-source model weights from a public CDN.
Which languages and formats are supported?
Whisper is multilingual and handles 90+ languages, including Korean, English, Japanese, Chinese, Spanish and more, with auto-detection. You can export SRT, WebVTT, or a plain-text transcript, and optionally translate non-English speech to English subtitles.
Which model should I choose?
Small is the recommended default and the practical minimum for good Korean and other CJK languages. Tiny is fastest and lightest but less accurate; Turbo (large-v3-turbo) is the most accurate but downloads several hundred megabytes and runs best with WebGPU. All models are downloaded once and cached.
Why is the first run slow?
The first time you use a model, its weights download once (tens of MB for Tiny/Small, more for Turbo) and are then cached for next time. Transcription itself is much faster with WebGPU-capable browsers; without a GPU it falls back to the CPU and long videos can take a while.
Are the captions accurate enough to publish?
Auto-generated captions are a strong first draft but not perfect — they can mishear names or add stray text on music or silence. That is why every line is editable here: review and fix the transcript before you download it, especially for accessibility.
Is there a file size limit?
Everything runs in your browser's memory, so very large or very long files can be slow or run out of memory. Files over about 500 MB show a warning and files over 2 GB are blocked. For long recordings, a shorter clip or a smaller model helps.

Related tools