VideoCaptioner: AI-Powered Video Subtitling Assistant
Complete setup guide for VideoCaptioner - an LLM-powered intelligent subtitle assistant for video. Includes CLI and GUI installation, ASR engines (Bijian, Jianying, Whisper), LLM-based subtitle optimization, translation services, TTS dubbing, and video synthesis with customizable subtitle styling.
- Step 1
About VideoCaptioner
VideoCaptioner is an AI-powered video subtitling tool that handles the complete subtitle workflow: speech recognition (ASR), subtitle segmentation, LLM-based optimization, translation, and video synthesis. It supports multiple ASR engines including free cloud-based options (Bijian, Jianying) and local Whisper models. The tool features a modern PyQt5 GUI and a full-featured CLI for automation and scripting.
Key Features: - Speech-to-text transcription via multiple ASR engines - Semantic-based subtitle segmentation using LLM - Context-aware translation with reflection mechanism - Voice dubbing (TTS) with Edge, SiliconFlow, Gemini - Video synthesis with customizable subtitle styling - Free cloud services for ASR and translation (no API key needed) - Cross-platform: Windows, macOS, Linux - Step 2
System Requirements
VideoCaptioner requires Python 3.10+ and FFmpeg for video processing. The free cloud-based features work with minimal requirements. Local Whisper models require additional disk space (1-3GB) and benefit from GPU acceleration. PyQt5 GUI works on all major platforms with pre-compiled binaries available for Windows.
# Check Python version (3.10 or higher required) python --version python3 --version # Check FFmpeg installation (required for video processing) ffmpeg -version ffprobe -version # If FFmpeg is not installed: # Ubuntu/Debian: sudo apt update && sudo apt install ffmpeg # macOS (with Homebrew): brew install ffmpeg # Windows (with Chocolatey): choco install ffmpeg # Windows (with Scoop): scoop install ffmpeg # Verify installation ffmpeg -version | head -1 ffprobe -version | head -1⚠ Heads up: FFmpeg is a hard requirement. Without it, video processing features will fail. The 'doctor' command can diagnose missing dependencies: `videocaptioner doctor` - Step 3
Installation via pip (Recommended)
The easiest way to install VideoCaptioner is via pip. This installs both the CLI and GUI components. The package handles all Python dependencies automatically. No additional configuration is needed for free features (Bijian ASR, Bing/Google translation).
# Install VideoCaptioner (CLI + GUI) pip install videocaptioner # Verify installation videocaptioner --version videocaptioner --help # Launch GUI (three equivalent ways) videocaptioner videocaptioner gui videocaptioner-gui # Run CLI help to see all commands videocaptioner --help - Step 4
Installation via Windows Package
Windows users can download pre-built installers from GitHub Releases. This avoids the need to install Python separately and provides a traditional Windows installation experience with Start menu integration.
1. Visit: https://github.com/WEIFENG2333/VideoCaptioner/releases 2. Download the latest Windows installer (e.g., videocaptioner-{version}-windows-x64.exe) 3. Run the installer and follow the setup wizard 4. Launch from Start Menu or desktop shortcut 5. The application includes Python runtime and all dependencies⚠ Heads up: Windows packages are signed but may trigger SmartScreen warnings. This is expected for unsigned or newly published installers. Download only from official GitHub releases. - Step 5
Installation via macOS Script
macOS users can use the one-line installation script which handles dependencies and configuration automatically. This is the recommended approach for macOS users who want a quick setup.
# One-line macOS installation curl -fsSL https://raw.githubusercontent.com/WEIFENG2333/VideoCaptioner/master/scripts/run.sh | bash # Alternative: Install via pip with platform-specific optimizations pip install videocaptioner # macOS security: Remove quarantine attribute if app is blocked xattr -d com.apple.quarantine $(which videocaptioner-gui) 2>/dev/null || true # Launch the GUI videocaptioner-gui⚠ Heads up: macOS may block the application due to security settings. If the app doesn't launch, go to System Settings → Privacy & Security → click 'Open Anyway' next to the VideoCaptioner warning. - Step 6
Development Setup
For development, clone the repository and use uv (Python package manager) for dependency management. The project uses uv for fast, reproducible builds with proper virtual environment isolation.
# Clone the repository git clone https://github.com/WEIFENG2333/VideoCaptioner.git cd VideoCaptioner # Install uv if not already installed curl -LsSf https://astral.sh/uv/install.sh | sh # Sync dependencies and create virtual environment uv sync # Run GUI uv run videocaptioner # Run CLI help uv run videocaptioner --help # Type checking uv run pyright # Run tests uv run pytest tests/test_cli/ -q # Lint with ruff uv run ruff check videocaptioner - Step 7
Configuration Fundamentals
VideoCaptioner uses a layered configuration system with priority: CLI arguments > Environment variables > Config file > Defaults. The configuration file is located at ~/.config/videocaptioner/config.toml on Linux/macOS. Run 'config init' for interactive setup or 'doctor' to diagnose issues.
# Interactive configuration setup videocaptioner config init # Non-interactive setup with profile (for CI/automated environments) videocaptioner config init --non-interactive --profile dubbing # View current configuration videocaptioner config show # Get specific config value videocaptioner config get llm.api_key # Set configuration values videocaptioner config set llm.api_key sk-your-api-key-here videocaptioner config set llm.api_base https://api.openai.com/v1 videocaptioner config set llm.model gpt-4o-mini # Find config file location videocaptioner config path # Diagnose environment and dependencies videocaptioner doctor videocaptioner doctor --json # JSON output for scripting⚠ Heads up: API keys should never be committed to version control. Use environment variables or secure config storage. The --json flag for doctor is useful for CI/CD diagnostics. - Step 8
Configuration File Format
The configuration file uses TOML format with sections for llm, transcribe, subtitle, translate, and dubbing. Each section contains provider-specific settings. Edit the config file directly or use the config CLI commands to modify settings.
# Config file: ~/.config/videocaptioner/config.toml # LLM settings (for subtitle optimization and translation) [llm] api_key = "sk-your-api-key" api_base = "https://api.openai.com/v1" model = "gpt-4o-mini" # Transcription settings [transcribe] asr = "bijian" # Options: bijian, jianying, whisper-api, whisper-cpp, faster-whisper language = "auto" # Auto-detect or specify: zh, en, ja, etc. # Subtitle processing [subtitle] optimize = true # LLM-based text correction split = true # Semantic-based segmentation # Translation settings [translate] service = "bing" # Options: bing (free), google (free), llm (paid) # Dubbing/TTS settings [dubbing] preset = "edge-cn-female" # Edge TTS preset api_key = "" # For SiliconFlow/Gemini voice = "xiaoxiao" timing = "balanced" # Options: strict, balanced, natural, none audio_mode = "replace" # Options: replace, mix, duck tts_workers = 5 # Concurrent TTS workers - Step 9
Quick Start: Speech-to-Text Transcription
The free Bijian ASR engine (requires network, no API key) provides high-quality transcription for Chinese and English audio. Simply run the transcribe command with your video or audio file. The output is an SRT subtitle file by default.
# Transcribe video using free Bijian ASR (Chinese/English) videocaptioner transcribe video.mp4 --asr bijian # Transcribe with explicit language videocaptioner transcribe video.mp4 --asr bijian --language zh videocaptioner transcribe video.mp4 --asr bijian --language en # Transcribe audio files directly videocaptioner transcribe audio.mp3 --asr bijian # Output to custom path videocaptioner transcribe video.mp4 --asr bijian -o output_dir/ # Generate word-level timestamps (for advanced processing) videocaptioner transcribe video.mp4 --asr bijian --word-timestamps # Output in different formats videocaptioner transcribe video.mp4 --asr bijian --format json videocaptioner transcribe video.mp4 --asr bijian --format ass # Advanced Subtitles⚠ Heads up: Bijian and Jianying ASR engines only support Chinese and English. For other languages, use whisper-api (requires API key) or whisper-cpp (local model). - Step 10
Subtitle Translation (Free Services)
VideoCaptioner provides free translation via Bing and Google APIs (no API key required). The subtitle command can optimize existing subtitles, segment them semantically, and translate to target languages. Support for 100+ languages via BCP 47 codes.
# Translate subtitle to English using free Bing translation videocaptioner subtitle input.srt --translator bing --target-language en # Translate to Japanese videocaptioner subtitle input.srt --translator bing --target-language ja # Translate to Korean videocaptioner subtitle input.srt --translator bing --target-language ko # Translate to Spanish videocaptioner subtitle input.srt --translator bing --target-language es # Use Google translation (also free) videocaptioner subtitle input.srt --translator google --target-language fr # Create bilingual subtitles (Chinese source + English target) videocaptioner subtitle input.srt --translator bing --target-language en --layout target-above # Optimize without translation (LLM-based correction) videocaptioner subtitle input.srt --no-translate --api-key $OPENAI_API_KEY # Skip optimization, only translate videocaptioner subtitle input.srt --translator bing --target-language en --no-optimize # List of common language codes: # zh-Hans (Simplified Chinese), zh-Hant (Traditional Chinese) # en (English), ja (Japanese), ko (Korean) # fr (French), de (German), es (Spanish) # ru (Russian), ar (Arabic), pt (Portuguese) - Step 11
LLM-Based Subtitle Translation
For higher quality results, use LLM-based optimization to correct ASR errors, improve punctuation, and enhance readability. LLM translation offers reflection-based optimization for better context understanding. Requires OpenAI-compatible API key.
# Configure LLM API (OpenAI or compatible) videocaptioner config set llm.api_key $OPENAI_API_KEY videocaptioner config set llm.api_base https://api.openai.com/v1 videocaptioner config set llm.model gpt-4o-mini # Or use environment variables export OPENAI_API_KEY="sk-your-key" export OPENAI_BASE_URL="https://api.openai.com/v1" export OPENAI_MODEL="gpt-4o-mini" # Optimize subtitles with LLM (correct ASR errors) videocaptioner subtitle input.srt --api-key $OPENAI_API_KEY # Translate with LLM (higher quality, context-aware) videocaptioner subtitle input.srt --translator llm --target-language en --api-key $OPENAI_API_KEY # Enable reflection-based translation (higher quality, slower) videocaptioner subtitle input.srt --translator llm --target-language en --reflect --api-key $OPENAI_API_KEY # Custom prompt for optimization videocaptioner subtitle input.srt --prompt "Make subtitles more formal and professional" --api-key $OPENAI_API_KEY # Alternative LLM providers (OpenAI-compatible): # - SiliconCloud: https://cloud.siliconflow.cn # - DeepSeek: https://platform.deepseek.com # - Local: Ollama with OpenAI compatibility layer⚠ Heads up: LLM-based features incur API costs. gpt-4o-mini is cost-effective. Reflection mode uses more tokens for better quality but is slower and more expensive. - Step 12
Video Synthesis: Burn Subtitles into Video
The synthesize command burns subtitles into video as hard-coded (permanently embedded) or soft subtitles (removable in player). VideoCaptioner supports two rendering modes: ASS (traditional with outline/shadow) and rounded (modern with rounded background). Custom subtitle styles are available via the style command.
# Burn subtitles as hard-coded (visible in video) videocaptioner synthesize video.mp4 -s subtitle.srt --subtitle-mode hard # Add removable soft subtitles (track embedded) videocaptioner synthesize video.mp4 -s subtitle.srt --subtitle-mode soft # High quality output videocaptioner synthesize video.mp4 -s subtitle.srt --subtitle-mode hard --quality ultra # Use preset style (anime style) videocaptioner synthesize video.mp4 -s subtitle.srt --subtitle-mode hard --style anime # Custom subtitle styling via JSON override videocaptioner synthesize video.mp4 -s subtitle.srt --subtitle-mode hard \ --style-override '{"outline_color": "#ff0000", "font_size": 48}' # Rounded background mode (modern look) videocaptioner synthesize video.mp4 -s subtitle.srt --subtitle-mode hard --render-mode rounded # Custom rounded style with white text and semi-transparent background videocaptioner synthesize video.mp4 -s subtitle.srt --subtitle-mode hard \ --render-mode rounded \ --style-override '{"text_color": "#ffffff", "bg_color": "#ff000099", "corner_radius": 12}' # View all available styles videocaptioner style # Output to custom path videocaptioner synthesize video.mp4 -s subtitle.srt --subtitle-mode hard -o output.mp4 - Step 13
Voice Dubbing (TTS)
Generate dubbed audio or video from subtitles using various TTS services. Edge TTS is free and works without API key. SiliconFlow CosyVoice and Gemini TTS offer higher quality with voice cloning capabilities. Supports multi-speaker attribution and voice mapping.
# Generate audio using Edge TTS (free, no API key) videocaptioner dub subtitle.srt --preset edge-cn-female -o dub.wav # Chinese female voice (Edge TTS) videocaptioner dub input.srt --preset edge-cn-female -o output.wav # English friendly voice (Gemini, requires API key) videocaptioner dub input.srt --preset gemini-en-friendly \ --tts-api-key $VIDEOCAPTIONER_TTS_API_KEY -o output.wav # SiliconFlow CosyVoice2 videocaptioner dub input.srt --preset siliconflow-cn-female \ --tts-api-key $VIDEOCAPTIONER_TTS_API_KEY -o output.wav # Multi-speaker with voice mapping videocaptioner dub input.srt --video video.mp4 \ --speaker-voice Alice=anna \ --speaker-voice Bob=benjamin \ -o video_dubbed.mp4 # Speaker syntax in subtitle file: # [Alice] Hello, this is Alice speaking. # Bob: This line uses another voice. # Voice cloning with SiliconFlow videocaptioner dub input.srt \ --speaker-clone Alice=reference_audio.mp3|This is the reference text \ --tts-api-key $VIDEOCAPTIONER_TTS_API_KEY \ -o output.mp4 # Timing strategies videocaptioner dub input.srt --preset edge-cn-female --timing strict -o output.wav # Match subtitle timing videocaptioner dub input.srt --preset edge-cn-female --timing natural -o output.wav # Natural speech speed # Audio output modes when embedding in video videocaptioner dub input.srt --video video.mp4 --audio-mode replace -o output.mp4 # Replace original audio videocaptioner dub input.srt --video video.mp4 --audio-mode mix -o output.mp4 # Mix with original videocaptioner dub input.srt --video video.mp4 --audio-mode duck -o output.mp4 # Lower original as background⚠ Heads up: Edge TTS requires network access and may have regional restrictions. Voice cloning with SiliconFlow requires API key and reference audio samples. - Step 14
Full Pipeline: End-to-End Processing
The process command automates the complete workflow: video → transcription → optimization → translation → dubbing → synthesis. This is the most powerful command for automated video localization. All intermediate files are generated in the output directory.
# Full pipeline with free services (ASR + Translation) videocaptioner process video.mp4 --asr bijian --translator bing --target-language en # Full pipeline including dubbing (generate dubbed video) videocaptioner process video.mp4 --asr bijian --translator bing --target-language ja --dub-only # Full pipeline with Edge TTS dubbing videocaptioner process video.mp4 \ --asr bijian \ --translator bing \ --target-language zh-Hans \ --dub-only \ --timing strict # Chinese video → English dubbed video with Gemini TTS videocaptioner process video.mp4 \ --translator bing \ --target-language en \ --dub-only \ --preset gemini-en-friendly \ --tts-api-key $VIDEOCAPTIONER_TTS_API_KEY # Full pipeline with LLM optimization videocaptioner process video.mp4 \ --asr whisper-api \ --whisper-api-key $WHISPER_API_KEY \ --translator llm \ --target-language fr \ --api-key $OPENAI_API_KEY \ -v # Verbose output # Output to custom directory videocaptioner process video.mp4 --asr bijian --translator bing --to en -o ./output/ # Quiet mode for scripting videocaptioner process video.mp4 --asr bijian --translator bing --to en -q - Step 15
Download Online Videos
VideoCaptioner integrates yt-dlp for downloading videos from YouTube, Bilibili, and many other platforms. This is useful for processing online content directly without manual downloading.
# Download from YouTube videocaptioner download "https://youtube.com/watch?v=xxx" # Download to specific directory videocaptioner download "https://youtube.com/watch?v=xxx" -o ./downloads/ # Download from Bilibili videocaptioner download "https://bilibili.com/video/BVxxxxx" # Then process the downloaded video videocaptioner process downloaded_video.mp4 --asr bijian --translator bing --to en - Step 16
GUI Desktop Application
The PyQt5-based GUI provides a user-friendly interface for all VideoCaptioner features. It includes visual progress tracking, drag-and-drop file upload, and integrated settings management. The GUI automatically opens when running videocaptioner without arguments.
Launch Methods: - videocaptioner (opens GUI by default) - videocaptioner gui (explicit GUI launch) - videocaptioner-gui (GUI-only command) GUI Features: - Drag-and-drop video files - Visual progress bars for each processing stage - Configurable ASR/translation/TTS settings - Built-in subtitle editor - Preview rendered subtitles - Batch processing queue - Project management Settings Location in GUI: - File: Settings → API Configuration - Or use CLI: videocaptioner config init Supported File Formats: - Video: MP4, MKV, AVI, MOV, WEBM - Audio: MP3, WAV, FLAC, M4A - Subtitle: SRT, VTT, ASS, SUB - Step 17
Troubleshooting Common Issues
The doctor command diagnoses most issues. Common problems include missing FFmpeg, API key misconfiguration, and macOS security blocking. Check the application logs and use verbose mode for detailed diagnostics.
# Run diagnostics videocaptioner doctor videocaptioner doctor --json # For scripting/CI # Check if FFmpeg is properly installed which ffmpeg ffmpeg -version # Test API connectivity (OpenAI) curl -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models # View verbose logs videocaptioner transcribe video.mp4 --asr bijian -v # macOS security issues xattr -d com.apple.quarantine /path/to/videocaptioner # Port already in use (GUI) videocaptioner config set server.port 8889 # Disk space for Whisper models df -h # Check available space # Clear cache if models are corrupted rm -rf ~/.cache/whisper # Python version issues python --version # Must be 3.10-3.12 python3.11 --version # Try specific version # Virtual environment conflicts which python which pip python -m pip list | grep videocaptioner⚠ Heads up: Exit codes indicate error types: 0=success, 1=general error, 2=config error, 3=file not found, 4=missing dependency, 5=runtime error. - Step 18
Performance Optimization
Optimize processing speed by choosing appropriate ASR engines, enabling GPU acceleration for local Whisper models, and adjusting TTS worker concurrency. Use smaller LLM models for faster turnaround when quality requirements allow.
# ASR speed comparison (fastest to slowest): # Bijian (cloud) - Fast, free, Chinese/English only # Jianying (cloud) - Fast, free, Chinese/English only # Whisper API - Requires payment, all languages # Faster-Whisper (local) - GPU accelerated, requires VRAM # Whisper-CPP (local) - CPU friendly, slower # GPU acceleration for local Whisper # NVIDIA GPU users: nvidia-smi # Verify GPU detection # Install CUDA-enabled pytorch for faster processing # Adjust TTS concurrency for faster dubbing videocaptioner config set dubbing.tts_workers 10 # Default is 5 # Use cost-effective LLM models videocaptioner config set llm.model gpt-4o-mini # Cheaper than gpt-4 # Batch processing (process multiple videos) for video in *.mp4; do videocaptioner process "$video" --asr bijian --translator bing --to en -q done # Parallel processing with xargs ls *.mp4 | xargs -P 4 -I {} videocaptioner process {} --asr bijian --translator bing --to en -q - Step 19
Environment Variables Reference
Environment variables provide secure configuration for API keys and override default settings. This is the recommended approach for CI/CD pipelines and automated workflows. Variables are checked before config file values.
# LLM Configuration export OPENAI_API_KEY="sk-your-key" export OPENAI_BASE_URL="https://api.openai.com/v1" export OPENAI_MODEL="gpt-4o-mini" # TTS/Dubbing Configuration export VIDEOCAPTIONER_DUB_PRESET="edge-cn-female" export VIDEOCAPTIONER_TTS_API_KEY="your-tts-api-key" export VIDEOCAPTIONER_TTS_API_BASE="https://api.example.com" export VIDEOCAPTIONER_TTS_MODEL="cosyvoice2" export VIDEOCAPTIONER_TTS_VOICE="xiaoxiao" export VIDEOCAPTIONER_TTS_WORKERS=5 export VIDEOCAPTIONER_DUB_TIMING="balanced" export VIDEOCAPTIONER_DUB_AUDIO_MODE="replace" export VIDEOCAPTIONER_TTS_MAX_SPEED=1.5 export VIDEOCAPTIONER_TTS_REWRITE_TOO_LONG=true # Example: Run with all env vars export OPENAI_API_KEY="sk-xxx" export VIDEOCAPTIONER_TTS_API_KEY="your-key" videocaptioner process video.mp4 --asr bijian --translator bing --to en - Step 20
API Provider Options
VideoCaptioner supports multiple providers for each service stage. Cloud-based services require network access but no local compute. LLM services must be OpenAI API-compatible. TTS services vary in quality and pricing.
ASR (Speech Recognition) Providers: - Bijian (必剪): FREE, cloud-based, Chinese/English, no API key - Jianying (剪映): FREE, cloud-based, Chinese/English, no API key - Whisper API: Paid (OpenAI), all languages, highest accuracy - Faster-Whisper: Local models, offline, requires GPU for speed - Whisper-CPP: Local models, CPU-friendly, slower Translation Providers: - Bing: FREE, good quality, all languages - Google: FREE, good quality, all languages - LLM: Paid (OpenAI-compatible), contextual, highest quality TTS (Text-to-Speech) Providers: - Edge TTS: FREE, Microsoft Azure, good quality voices - SiliconFlow CosyVoice2: Paid, Chinese-focused, voice cloning - Gemini TTS: Paid, Google, natural sounding - Coqui TTS: Self-hosted, open-source alternatives LLM Providers (for optimization/translation): - OpenAI: gpt-4, gpt-4o-mini, gpt-3.5-turbo - DeepSeek: deepseek-chat (cost-effective) - SiliconCloud: Multiple model options - Local: Ollama, LM Studio with OpenAI compatibility - Step 21
Project Structure and Resources
VideoCaptioner is built with Python using modern tooling. The project structure separates CLI, UI, and core processing modules. Resources include subtitle styles, translations, and assets bundled with the package.
Tech Stack: - Core: Python 3.10+ with type hints - CLI: Click-based command-line interface - GUI: PyQt5 + PyQt-Fluent-Widgets - Audio Processing: pydub (FFmpeg wrapper) - Video Processing: FFmpeg/FFprobe - TTS: edge-tts, SiliconFlow API, Gemini API - LLM: openai SDK (100% compatible with OpenAI API) - Download: yt-dlp (supports 1000+ sites) - Config: TOML files with platformdirs - Package Manager: uv (modern, fast) - Build: hatchling with hatch-vcs - Testing: pytest with markers for integration/LLM tests - Linting: Ruff + Pyright Project Structure: videocaptioner/ ├── cli/ # Command-line interface ├── ui/ # PyQt5 GUI ├── core/ # Processing logic ├── asr/ # Speech recognition ├── subtitle/ # Subtitle processing ├── translate/ # Translation services ├── dubbing/ # TTS and voice generation ├── resource/ # Styles, translations, assets └── tests/ # Unit and integration tests Key Dependencies: - requests: HTTP client - openai: LLM integration - yt-dlp: Video downloading - pydub: Audio manipulation - PyQt5: GUI framework - platformdirs: Cross-platform paths - tenacity: Retry logic - pillow: Image processing - fonttools: Font manipulation - Step 22
Links and Resources
Official documentation, GitHub repository, online documentation, and community resources are available for further reference and support.
Official Resources: - GitHub: https://github.com/WEIFENG2333/VideoCaptioner - Documentation: https://weifeng2333.github.io/VideoCaptioner/ - Online Demo: https://www.videocaptioner.cn - Releases: https://github.com/WEIFENG2333/VideoCaptioner/releases Related Technologies: - Whisper (OpenAI): https://github.com/openai/whisper - Faster-Whisper: https://github.com/guillaumekln/faster-whisper - Edge TTS: Microsoft Azure TTS - yt-dlp: https://github.com/yt-dlp/yt-dlp - PyQt5: https://www.riverbankcomputing.com/software/pyqt - FFmpeg: https://ffmpeg.org Community: - Issues: https://github.com/WEIFENG2333/VideoCaptioner/issues - Chinese Community: QQ Group (see GitHub README) - License: GPL-3.0
Feature requests
Sign in to suggest features or vote on existing ones.
No feature requests yet.
Discussion
Sign in to join the discussion.
No comments yet.