cuttledoccuttledoc

Getting Started

Fast speech-to-text transcription library for Node.js with local and cloud backends

cuttledoc is a fast speech-to-text transcription library for Node.js. It supports multiple backends (local and cloud) and optional LLM enhancement for formatting transcripts.

Installation

pnpm add cuttledoc

Requirements

  • Node.js 24+
  • ~2GB disk space for models

CLI Usage

# Basic transcription (uses local Parakeet)
npx cuttledoc video.mp4

# With LLM enhancement (adds formatting, TLDR, corrections)
npx cuttledoc podcast.mp3 --enhance -o transcript.md

# Use specific backend and language
npx cuttledoc meeting.m4a -b parakeet -l de

# Use OpenAI cloud API (best quality)
export OPENAI_API_KEY=sk-...
npx cuttledoc meeting.m4a -b openai

# Show processing statistics
npx cuttledoc audio.wav --stats

API Usage

import { transcribe } from 'cuttledoc'

// Local transcription (offline)
const result = await transcribe('audio.mp3', {
  language: 'en',
  backend: 'auto' // auto, whisper, parakeet, openai
})

console.log(result.text)
console.log(`Duration: ${result.durationSeconds}s`)

// Cloud transcription (OpenAI)
const cloudResult = await transcribe('audio.mp3', {
  backend: 'openai',
  apiKey: process.env.OPENAI_API_KEY
})

With LLM Enhancement

import { transcribe } from 'cuttledoc'
import { enhanceTranscript } from 'cuttledoc/llm'

const result = await transcribe('podcast.mp3')

const enhanced = await enhanceTranscript(result.text, {
  model: 'gemma3n:e4b',
  mode: 'enhance' // or 'correct' for minimal changes
})

console.log(enhanced.markdown)

Quality Benchmark

Word Error Rate (WER) on FLEURS native speaker recordings:

Backend🇬🇧 EN🇪🇸 ES🇩🇪 DE🇫🇷 FR🇧🇷 PTAvg WERRTF
gpt-4o-mini-transcribe5.7%1.3%3.4%7.3%6.0%4.8%0.10
gpt-4o-transcribe9.9%2.1%2.8%6.3%4.6%5.1%0.16
Whisper large-v34.9%2.1%2.8%10.6%5.2%5.1%2.2
Parakeet v34.6%3.6%4.5%10.1%9.0%6.4%0.24

RTF = Real-Time Factor (lower = faster). All values measured on Apple M1 Pro.

🏆 Ranking by Accuracy

RankBackendAvg WERBest for
🥇gpt-4o-mini-transcribe4.8%Cloud, best overall + cheapest
🥈gpt-4o-transcribe5.1%Cloud, best for DE
🥈Whisper large-v35.1%Offline, broadest language support
4Parakeet v36.4%Fast + accurate, 25 European langs

⚡ Ranking by Speed

RankBackendRTFBest for
🥇gpt-4o-mini-transcribe0.10Cloud, fastest + cheapest
🥈gpt-4o-transcribe0.16Cloud, premium quality
🥉Parakeet v30.24Real-time, batch processing
4Whisper large-v32.2Quality-focused, offline

RTF = Real-Time Factor. 0.10 means 10s audio transcribed in 1.0s.

Available Backends

Local Backends (Offline, No API Key)

BackendRTFAvg WERLanguagesSize
Parakeet v3 (default)0.246.4%25160 MB
Whisper large-v32.25.1%991.6 GB

Cloud Backends (Requires API Key)

BackendRTFAvg WERLanguagesCost
gpt-4o-mini-transcribe0.104.8%50+~$0.003/min
gpt-4o-transcribe0.165.1%50+~$0.006/min

Model Management

# List available models
cuttledoc models list

# Download speech models
cuttledoc models download parakeet-tdt-0.6b-v3   # 160 MB, 25 languages
cuttledoc models download whisper-large-v3       # 1.6 GB, 99 languages

# Download LLM model (for --enhance)
cuttledoc models download gemma3n:e4b

Next Steps

On this page