A local CPU-only batch audio transcription tool optimized for RAGflow integration. Uses OpenAI Whisper models running entirely on your machine - no data leaves your system.
- 100% Local Processing - No external API calls, all processing on CPU
- Automatic Model Selection - Detects available RAM and suggests optimal Whisper model
- Batch Processing - Process entire directories of audio files
- RAGflow Optimized - Markdown output with metadata and chunking options
- Progress Tracking - Rich progress bars for individual files and batches
- Multi-format Support - Handles audio (MP3, WAV, M4A, FLAC, OGG, WMA, AAC) and video (MP4, AVI, MOV, MKV, WMV, FLV, WebM, M4V)
- Memory Safe - Monitors system resources and prevents OOM errors
Using GitHub Container Registry (fastest):
# Use pre-built images - no build required!
docker run --rm -v ./input:/input -v ./output:/output -v ./models:/models \
ghcr.io/user/direct-transcriber:latest \
direct-transcriber batch /input --output-dir /output --yes
# Or with docker-compose
curl -O https://raw.githubusercontent.com/user/direct-transcriber/main/docker-compose.ghcr.yml
docker-compose -f docker-compose.ghcr.yml up
# Clone repository
git clone https://github.com/norandom/direct-transriberr
cd direct-transcriber
# Build and run with Docker Compose
docker-compose up --build
Requirements:
- Python 3.9+
- FFmpeg (for audio processing)
- uv (for dependency management)
Install with uv:
# Clone repository
git clone https://github.com/norandom/direct-transriberr
cd direct-transcriber
# Install with uv
uv pip install -e .
System Dependencies:
Ubuntu/Debian:
sudo apt update
sudo apt install ffmpeg
macOS:
brew install ffmpeg
CentOS/RHEL:
sudo yum install ffmpeg
Using the helper script for external files:
# Batch process external directory
./scripts/transcribe-external.sh -m /path/to/your/media -o /path/to/output
# Single external file
./scripts/transcribe-external.sh -f /path/to/video.mp4 -o /path/to/output
# With specific model
./scripts/transcribe-external.sh -m /path/to/media -o /path/to/output --model large-v3
Direct Docker Compose usage:
# Copy files to input directory
cp /path/to/your/media/* ./input/
# Run transcription (models will be downloaded to ./models on first run)
docker-compose up --build
# For external directories, set environment variables
MEDIA_DIR=/path/to/your/media OUTPUT_DIR=/path/to/output docker-compose --profile external up --build transcriber-external
Prerequisites:
# Activate the virtual environment
source .venv/bin/activate
Batch Processing:
# Process all audio and video files in a directory
direct-transcriber batch /path/to/media/files/
# With custom output directory
direct-transcriber batch /audio/ --output-dir /transcriptions/
# Force specific model
direct-transcriber batch /audio/ --model large-v3
# Include timestamps for reference
direct-transcriber batch /audio/ --timestamps
# Chunk output for better RAG performance
direct-transcriber batch /audio/ --chunk-size 500
# Save both markdown and JSON
direct-transcriber batch /audio/ --format both
Single File:
# Transcribe single audio file
direct-transcriber single audio.mp3
# Transcribe single video file (extracts audio automatically)
direct-transcriber single video.mp4
# With custom output
direct-transcriber single media.mp4 --output transcript.md
# JSON output
direct-transcriber single media.mp4 --format json
# RAG-optimized output with intelligent chunking
direct-transcriber single media.mp4 --rag-optimized --chunking-strategy semantic
# Fixed-size chunking for consistent chunk sizes
direct-transcriber single media.mp4 --rag-optimized --chunking-strategy fixed --chunk-size 1000
# Example: Process a video segment with medium model and RAG optimization
direct-transcriber single test_5min.mp4 --model medium --rag-optimized --chunking-strategy semantic --output test_5min_medium.md
--model, -m
: Whisper model (tiny, base, small, medium, large-v3)--output-dir, -o
: Output directory for batch processing--format, -f
: Output format (md, json, both)--timestamps
: Include timestamps in markdown--chunk-size
: Chunk size for RAG optimization (characters)--rag-optimized
: Enable RAG-optimized output with intelligent chunking--chunking-strategy
: Chunking strategy (semantic, sentence, fixed)--workers, -w
: Number of parallel workers (auto-detected)--yes, -y
: Skip confirmation prompts
The tool automatically selects the best Whisper model based on available RAM:
Model | RAM Required | Quality | Speed |
---|---|---|---|
tiny | 1 GB | Lowest | Fastest |
base | 1.5 GB | Good | Fast |
small | 2 GB | Better | Moderate |
medium | 4 GB | High | Slower |
large-v3 | 6 GB | Best | Slowest |
# Video Transcription
**File:** `/full/path/to/video/meeting.mp4`
**Duration:** 15:32 | **Model:** large-v3 | **Language:** en
**Source:** video | **Transcribed:** 2024-01-15 14:30:22
---
The speaker discusses the importance of machine learning in modern applications. They explain how neural networks can be trained to recognize patterns in data.
Another topic covered is the implementation of transformer models for natural language processing tasks.
# Audio Transcription
**File:** `/full/path/to/audio/meeting.mp3`
## [00:00] - [02:15]
The speaker discusses the importance of machine learning in modern applications.
## [02:15] - [04:30]
They explain how neural networks can be trained to recognize patterns in data.
# Audio Transcription
**File:** `/full/path/to/audio/meeting.mp3`
## Segment 1 (00:00-05:00)
[Content chunk optimized for semantic search]
## Segment 2 (05:00-10:00)
[Next semantic chunk]
The Docker image uses a multi-stage build process to minimize size while maintaining full functionality:
- Base: Python 3.11-slim for minimal footprint
- Size: ~800MB (compared to 2GB+ for standard PyTorch images)
- Security: Runs as non-root user
- Optimization: Removes test files, caches, and unnecessary components
ghcr.io/norandom/direct-transriberr:latest
- Latest stable releaseghcr.io/norandom/direct-transriberr:main
- Latest development buildghcr.io/norandom/direct-transriberr:v1.0.0
- Specific version tags
Create a .env
file based on .env.example
:
# External media directory (absolute path)
MEDIA_DIR=/path/to/your/media/files
# Output directory for transcriptions (absolute path)
OUTPUT_DIR=/path/to/your/transcriptions
# Model to use (optional, auto-detected if not specified)
WHISPER_MODEL=large-v3
# Memory limit for container (optional)
MEMORY_LIMIT=8G
The Docker setup supports several volume mounting options:
- Default directories:
./input
,./output
, and./models
- Environment variables: Use
MEDIA_DIR
andOUTPUT_DIR
- Single file mounting: Mount specific files for transcription
- External script: Use
scripts/transcribe-external.sh
for easy external file processing - Persistent model storage: Models are downloaded once to
./models
and reused
- CPU Optimization: Uses available CPU cores minus 1 for system
- Memory Management: Monitors RAM usage and prevents OOM
- Batch Processing: Processes multiple files with progress tracking
- Format Support: Automatic audio format conversion via FFmpeg
- Docker Isolation: Containerized processing with resource limits
- Model Persistence: Whisper models downloaded once and cached in
./models
- Enhanced Progress: Detailed progress tracking with processing times
- CPU Optimized: FP16 warnings suppressed, CPU-specific optimizations
Direct Transcriber now includes advanced RAG (Retrieval-Augmented Generation) optimizations:
Semantic Chunking (Recommended):
- Breaks content at natural topic boundaries
- Detects discourse markers and transitions
- Preserves context and meaning
- Ideal for complex discussions and lectures
Sentence Chunking:
- Groups sentences into coherent chunks
- Respects sentence boundaries
- Good for clear, structured speech
- Maintains readability
Fixed-Size Chunking:
- Consistent chunk sizes with smart overlap
- Predictable for downstream processing
- Good for batch processing workflows
- Keyword Extraction: Automatic identification of key terms
- Entity Recognition: Names, numbers, times, and proper nouns
- Topic Classification: Domain-specific topic identification
- Quality Scoring: Confidence-based chunk quality assessment
- Context Linking: Inter-chunk relationships and context
# Enable RAG optimization
direct-transcriber batch /audio --rag-optimized
# Choose chunking strategy
direct-transcriber batch /audio --rag-optimized --chunking-strategy semantic
# Custom chunk size
direct-transcriber batch /audio --rag-optimized --chunk-size 1500
Output Features:
- Structured markdown with semantic sections
- JSON sidecar files for programmatic access
- Cross-chunk context preservation
- Quality metrics and confidence scores
- Entity and keyword extraction
- Topic classification
Example RAG Output:
# Audio Transcription (RAG Optimized)
**Chunks:** 15 | **Strategy:** SemanticChunking
## Segment 1 (00:00 - 02:30)
**ID:** `lecture_001` | **Quality:** 0.92 | **Topics:** technology, AI
The speaker discusses machine learning fundamentals...
🏷️ **Entities:** Neural Networks, Deep Learning, PyTorch
- Better Retrieval: Semantic chunks improve search relevance
- Context Preservation: Overlapping chunks maintain continuity
- Quality Filtering: Low-confidence segments are flagged
- Structured Data: JSON output enables programmatic processing
- Metadata Rich: Enhanced information for better indexing
MIT License - see LICENSE file for details.