Getting Started
Fast speech-to-text transcription library for Node.js with local and cloud backends
cuttledoc is a fast speech-to-text transcription library for Node.js. It supports multiple backends (local and cloud) and optional LLM enhancement for formatting transcripts.
Installation
pnpm add cuttledocRequirements
- Node.js 24+
- ~2GB disk space for models
CLI Usage
# Basic transcription (uses local Parakeet)
npx cuttledoc video.mp4
# With LLM enhancement (adds formatting, TLDR, corrections)
npx cuttledoc podcast.mp3 --enhance -o transcript.md
# Use specific backend and language
npx cuttledoc meeting.m4a -b parakeet -l de
# Use OpenAI cloud API (best quality)
export OPENAI_API_KEY=sk-...
npx cuttledoc meeting.m4a -b openai
# Show processing statistics
npx cuttledoc audio.wav --statsAPI Usage
import { transcribe } from 'cuttledoc'
// Local transcription (offline)
const result = await transcribe('audio.mp3', {
language: 'en',
backend: 'auto' // auto, whisper, parakeet, openai
})
console.log(result.text)
console.log(`Duration: ${result.durationSeconds}s`)
// Cloud transcription (OpenAI)
const cloudResult = await transcribe('audio.mp3', {
backend: 'openai',
apiKey: process.env.OPENAI_API_KEY
})With LLM Enhancement
import { transcribe } from 'cuttledoc'
import { enhanceTranscript } from 'cuttledoc/llm'
const result = await transcribe('podcast.mp3')
const enhanced = await enhanceTranscript(result.text, {
model: 'gemma3n:e4b',
mode: 'enhance' // or 'correct' for minimal changes
})
console.log(enhanced.markdown)Quality Benchmark
Word Error Rate (WER) on FLEURS native speaker recordings:
| Backend | 🇬🇧 EN | 🇪🇸 ES | 🇩🇪 DE | 🇫🇷 FR | 🇧🇷 PT | Avg WER | RTF |
|---|---|---|---|---|---|---|---|
| gpt-4o-mini-transcribe | 5.7% | 1.3% | 3.4% | 7.3% | 6.0% | 4.8% | 0.10 |
| gpt-4o-transcribe | 9.9% | 2.1% | 2.8% | 6.3% | 4.6% | 5.1% | 0.16 |
| Whisper large-v3 | 4.9% | 2.1% | 2.8% | 10.6% | 5.2% | 5.1% | 2.2 |
| Parakeet v3 | 4.6% | 3.6% | 4.5% | 10.1% | 9.0% | 6.4% | 0.24 |
RTF = Real-Time Factor (lower = faster). All values measured on Apple M1 Pro.
🏆 Ranking by Accuracy
| Rank | Backend | Avg WER | Best for |
|---|---|---|---|
| 🥇 | gpt-4o-mini-transcribe | 4.8% | Cloud, best overall + cheapest |
| 🥈 | gpt-4o-transcribe | 5.1% | Cloud, best for DE |
| 🥈 | Whisper large-v3 | 5.1% | Offline, broadest language support |
| 4 | Parakeet v3 | 6.4% | Fast + accurate, 25 European langs |
⚡ Ranking by Speed
| Rank | Backend | RTF | Best for |
|---|---|---|---|
| 🥇 | gpt-4o-mini-transcribe | 0.10 | Cloud, fastest + cheapest |
| 🥈 | gpt-4o-transcribe | 0.16 | Cloud, premium quality |
| 🥉 | Parakeet v3 | 0.24 | Real-time, batch processing |
| 4 | Whisper large-v3 | 2.2 | Quality-focused, offline |
RTF = Real-Time Factor. 0.10 means 10s audio transcribed in 1.0s.
Available Backends
Local Backends (Offline, No API Key)
| Backend | RTF | Avg WER | Languages | Size |
|---|---|---|---|---|
| Parakeet v3 (default) | 0.24 | 6.4% | 25 | 160 MB |
| Whisper large-v3 | 2.2 | 5.1% | 99 | 1.6 GB |
Cloud Backends (Requires API Key)
| Backend | RTF | Avg WER | Languages | Cost |
|---|---|---|---|---|
| gpt-4o-mini-transcribe | 0.10 | 4.8% | 50+ | ~$0.003/min |
| gpt-4o-transcribe | 0.16 | 5.1% | 50+ | ~$0.006/min |
Model Management
# List available models
cuttledoc models list
# Download speech models
cuttledoc models download parakeet-tdt-0.6b-v3 # 160 MB, 25 languages
cuttledoc models download whisper-large-v3 # 1.6 GB, 99 languages
# Download LLM model (for --enhance)
cuttledoc models download gemma3n:e4b