Getting Started

Fast speech-to-text transcription library for Node.js with local and cloud backends

cuttledoc is a fast speech-to-text transcription library for Node.js. It supports multiple backends (local and cloud) and optional LLM enhancement for formatting transcripts.

Installation

pnpm add cuttledoc

Requirements

Node.js 24+
~2GB disk space for models

CLI Usage

# Basic transcription (uses local Parakeet)
npx cuttledoc video.mp4

# With LLM enhancement (adds formatting, TLDR, corrections)
npx cuttledoc podcast.mp3 --enhance -o transcript.md

# Use specific backend and language
npx cuttledoc meeting.m4a -b parakeet -l de

# Use OpenAI cloud API (best quality)
export OPENAI_API_KEY=sk-...
npx cuttledoc meeting.m4a -b openai

# Show processing statistics
npx cuttledoc audio.wav --stats

API Usage

import { transcribe } from 'cuttledoc'

// Local transcription (offline)
const result = await transcribe('audio.mp3', {
  language: 'en',
  backend: 'auto' // auto, whisper, parakeet, openai
})

console.log(result.text)
console.log(`Duration: ${result.durationSeconds}s`)

// Cloud transcription (OpenAI)
const cloudResult = await transcribe('audio.mp3', {
  backend: 'openai',
  apiKey: process.env.OPENAI_API_KEY
})

With LLM Enhancement

import { transcribe } from 'cuttledoc'
import { enhanceTranscript } from 'cuttledoc/llm'

const result = await transcribe('podcast.mp3')

const enhanced = await enhanceTranscript(result.text, {
  model: 'gemma3n:e4b',
  mode: 'enhance' // or 'correct' for minimal changes
})

console.log(enhanced.markdown)

Quality Benchmark

Word Error Rate (WER) on FLEURS native speaker recordings:

Backend	🇬🇧 EN	🇪🇸 ES	🇩🇪 DE	🇫🇷 FR	🇧🇷 PT	Avg WER	RTF
gpt-4o-mini-transcribe	5.7%	1.3%	3.4%	7.3%	6.0%	4.8%	0.10
gpt-4o-transcribe	9.9%	2.1%	2.8%	6.3%	4.6%	5.1%	0.16
Whisper large-v3	4.9%	2.1%	2.8%	10.6%	5.2%	5.1%	2.2
Parakeet v3	4.6%	3.6%	4.5%	10.1%	9.0%	6.4%	0.24

RTF = Real-Time Factor (lower = faster). All values measured on Apple M1 Pro.

🏆 Ranking by Accuracy

Rank	Backend	Avg WER	Best for
🥇	gpt-4o-mini-transcribe	4.8%	Cloud, best overall + cheapest
🥈	gpt-4o-transcribe	5.1%	Cloud, best for DE
🥈	Whisper large-v3	5.1%	Offline, broadest language support
4	Parakeet v3	6.4%	Fast + accurate, 25 European langs

⚡ Ranking by Speed

Rank	Backend	RTF	Best for
🥇	gpt-4o-mini-transcribe	0.10	Cloud, fastest + cheapest
🥈	gpt-4o-transcribe	0.16	Cloud, premium quality
🥉	Parakeet v3	0.24	Real-time, batch processing
4	Whisper large-v3	2.2	Quality-focused, offline

RTF = Real-Time Factor. 0.10 means 10s audio transcribed in 1.0s.

Available Backends

Local Backends (Offline, No API Key)

Backend	RTF	Avg WER	Languages	Size
Parakeet v3 (default)	0.24	6.4%	25	160 MB
Whisper large-v3	2.2	5.1%	99	1.6 GB

Cloud Backends (Requires API Key)

Backend	RTF	Avg WER	Languages	Cost
gpt-4o-mini-transcribe	0.10	4.8%	50+	~$0.003/min
gpt-4o-transcribe	0.16	5.1%	50+	~$0.006/min

Model Management

# List available models
cuttledoc models list

# Download speech models
cuttledoc models download parakeet-tdt-0.6b-v3   # 160 MB, 25 languages
cuttledoc models download whisper-large-v3       # 1.6 GB, 99 languages

# Download LLM model (for --enhance)
cuttledoc models download gemma3n:e4b