Word-level timestamps
Every transcript ships ready-to-embed subtitles — SRT and WebVTT — alongside plain text and JSON.
Drop an audio or video file, get a clean transcript with timestamps in seconds. Powered by an open-source AI model running on our own servers. Free. No signup needed.
Audio + video · up to 200 MB · 99 languages · files purged in 24h
Every transcript ships ready-to-embed subtitles — SRT and WebVTT — alongside plain text and JSON.
Auto-detect by default. Switch to translate mode and get any source language → English in one pass.
No OpenAI, no Google, no third-party API. Files are deleted within 24 hours. No model training.
How it works
01
Audio or video, up to 200 MB. mp3, wav, m4a, mp4, mov, webm — all accepted.
02
Choose a language (or auto-detect), a model size, and whether to translate to English.
03
Plain text, SRT, WebVTT, or full JSON with timestamps — all generated in one pass.
Self-hosted AI
SpeedScribe runs an open-source speech-recognition model on our own infrastructure — no calls to OpenAI, Google, or any other third-party API. We delete uploaded files automatically and never use them for model training.
interview.m4a · 22:14 · 21.8 MB
[00:00:04] “So we ran the experiment three times — every run gave us the same outcome.”
[00:00:11] “That’s the part that surprised everyone on the team.”
…
Languages
Auto-detect on by default. Pin a language only if the engine guesses wrong. Translate mode forces output to English from any source.
Made for
Generate show-notes and chapter timestamps in minutes.
Drop a Zoom recording, get searchable notes.
Export VTT and upload — no third-party caption service.
Transcribe research interviews while keeping files private.
Turn class recordings into study material.
Keep dictated notes searchable and exportable.
FAQ
Yes. Guests get a few free transcriptions and ~30 minutes of audio total. Registered users get 30 jobs/day and ~2 hours of audio/day, no credit card.
On clear English audio, very accurate (low single-digit % WER on the larger models). Noisy recordings, strong accents, and overlapping speakers are harder — pick a bigger model for those.
Common audio (mp3, wav, m4a, ogg, opus, flac) and video (mp4, mov, webm) up to 200 MB. We extract the audio internally; you don't have to convert anything beforehand.
Yes — 99 languages. Auto-detect is on by default. There's also a translate mode that forces output to English from any source language.
Every job emits four formats: plain text (.txt), SRT subtitles (.srt), WebVTT (.vtt), and a full JSON with segment-level timestamps. You can download any of them.
Guest uploads vanish in 24 hours. Logged-in users keep history 30 days, or delete it manually from the dashboard.
No. Your audio is used only to fulfil your transcription job. We don't repurpose it.
Roughly 0.3–0.6× the audio length on the default model. A 30-minute file finishes in ~10 minutes; a 5-minute file in under 2.
Free forever. Email is the only thing we ask for.