AI-powered instruction-driven multimodal video clip extraction
Audio • Visual • Speech • LLM Reasoning
⚡ Quick Start • 🤔 Why Clipz? • ✨ Features • 📚 API • 🚀 Roadmap
Many existing clip tools already do a great job with:
- audio-based excitement detection
- visual motion & scene analysis
- even basic LLM-assisted highlight detection
Clipz goes one step further — it’s instruction-driven.
Instead of passively finding “hot moments,” you tell the system what you want:
- “Extract the funniest moments”
- “Only clip Speaker B”
- “Find emotionally intense reactions”
An LLM interprets your intent and grounds it using:
✔ audio cues & prosody
✔ visual signals & scene context
✔ sentence-aware transcription
So clips aren’t just popular — they’re exactly aligned with your instruction.
👉 New to this project? Check out the Quick Start Guide for installation and first run in 5 minutes!
- 🎵 Multi-Scale Audio Analysis: Detects excitement through loudness, spectral novelty, rhythm, prosody, and semantic events (laughter, applause)
- 🎬 Advanced Visual Analysis: Tracks motion, semantic surprise (CLIP), composition quality, shot boundaries, and face detection
- 🗣️ Speech Transcription: Word-level timestamps using Whisper, respects sentence boundaries for natural clips
- 🤖 LLM-Powered Intelligence: Instruction-driven clip extraction with semantic merging and context-aware ranking
- ⚡ Parallel Processing: Multi-threaded feature extraction with automatic caching
- 🎯 Smart Clip Boundaries: Never cuts mid-sentence, aligns to natural speech segments
from main import ViralClipExtractor
# Initialize with custom weights
extractor = ViralClipExtractor(
audio_weight=0.5, # 0-1, weight for audio excitement
video_weight=0.5, # 0-1, weight for visual excitement
use_cache=True, # Cache features for faster re-runs
output_dir="output" # Output directory
)
# Process video end-to-end
results = extractor.process(
video_path="video.mp4",
user_query="give me 10 interesting clips", # Natural language query
target_fps=2, # Video analysis FPS (lower=faster)
min_duration=5, # Minimum clip length (seconds)
max_duration=60, # Maximum clip length (seconds)
export=True # Export video files
)
# Access results
for clip in results["clips"]:
print(f"Time: {clip['start']:.1f}s - {clip['end']:.1f}s")
print(f"Transcript: {clip['transcript']}")
print(f"Score: {clip['llm_interest_score']}/10")
print(f"Reason: {clip['reason']}")
print(f"Tags: {clip['tags']}")Features Extracted:
- Multi-scale loudness (short/long-term RMS)
- Spectral novelty via MFCC delta
- Rhythm variance and onset strength
- Silence contrast and dramatic pauses
- Structural boundaries (change-point detection)
- Semantic events (laughter, applause, cheering) via YAMNet
from Audio.audio import ClipAudio
detector = ClipAudio(sr=16000) # 16kHz optimized for speed
timestamps, scores = detector.compute_audio_scores(
audio_path="audio.wav",
use_cache=True
)Features Extracted:
- Optical flow motion magnitude
- CLIP semantic surprise detection
- Composition scoring (rule of thirds)
- Shot boundary detection
- Face detection and tracking
- Temporal rhythm analysis
from video.video import ClipVideo
detector = ClipVideo()
timestamps, scores = detector.compute_visual_scores(
video_path="video.mp4",
target_fps=2,
use_cache=True
)Returns: List of sentence segments with timestamps
from Transcription.transcribe import Transcriber
segments = Transcriber.transcribe_with_timestamps(
audio_path="audio.wav",
model_size="base", # tiny/base/small/medium/large
verbose=False
)
# [{"start": 0.0, "end": 3.5, "text": "Hello world"}, ...]Uses: OpenRouter API for GPT-4o-mini
from LLM.llm import LLM
llm = LLM()
response = llm.generate_text(
prompt="Your prompt here",
model="openai/gpt-4o-mini",
max_tokens=2000,
temperature=0.3
)usage: main.py [-h] [--query QUERY] [--audio-weight AUDIO_WEIGHT]
[--video-weight VIDEO_WEIGHT] [--fps FPS]
[--min-duration MIN_DURATION] [--max-duration MAX_DURATION]
[--output-dir OUTPUT_DIR] [--no-export]
video_path
positional arguments:
video_path Path to input video file
optional arguments:
-h, --help show this help message and exit
--query QUERY Query for clip selection (default: "give me 10 interesting clips")
--audio-weight AUDIO_WEIGHT
Weight for audio scores 0-1 (default: 0.5)
--video-weight VIDEO_WEIGHT
Weight for video scores 0-1 (default: 0.5)
--fps FPS Target FPS for video analysis (default: 2)
--min-duration MIN_DURATION
Minimum clip duration in seconds (default: 5)
--max-duration MAX_DURATION
Maximum clip duration in seconds (default: 60)
--output-dir OUTPUT_DIR
Output directory for clips (default: "output")
--no-export Skip exporting video files
The system generates:
- Location:
output/clips_<timestamp>/ - Format: Individual MP4 files (
clip_001.mp4,clip_002.mp4, etc.) - Content: Extracted video segments ready to use
- Clip Metadata:
.cache/metadata/- Individual JSON files for each clip - Analysis Report:
.cache/analysis/- Complete analysis metadata - Feature Cache:
.cache/audio/and.cache/video/- Cached features for faster re-runs - Transcription Cache:
.cache/transcription/- Cached transcripts
Example clip metadata:
{
"clip_number": 1,
"video_file": "output/clips_20260110_192101/clip_001.mp4",
"start_time": 45.2,
"end_time": 58.7,
"duration": 13.5,
"transcript": "...",
"interest_score": 9.5,
"reason": "Emotional storytelling with dramatic pause",
"tags": ["emotional", "dramatic"]
}Clean Output: Your output/ folder only contains the video clips - all metadata and cache files are organized in .cache/ to keep things tidy!
Test audio analysis:
python audio.py path/to/audio.wavTest video analysis:
python video.py path/to/video.mp4Test transcription:
python transcribe.py path/to/audio.wavThe system automatically caches expensive computations in the .cache/ directory:
- Audio features:
.cache/audio/audio_cache_<hash>.npz - Video features:
.cache/video/visual_cache_<hash>.npz - Transcriptions:
.cache/transcription/transcript_<hash>.json - Metadata:
.cache/metadata/and.cache/analysis/
This makes subsequent runs much faster! To disable caching:
extractor = ClipExtractor(use_cache=False)- GPU Acceleration: Install CUDA-enabled PyTorch for faster processing
- Lower FPS: Use
--fps 1for faster video analysis (less accurate) - Smaller Models: Whisper uses "base" model by default (good balance)
- Cache Results: Re-runs on the same video are much faster with caching
Key dependencies:
- ultralytics: YOLOv8 object detection
- transformers: CLIP model for semantic analysis
- whisper: Speech transcription
- librosa: Audio analysis
- opencv-python: Video processing
- torch: Deep learning framework
- dlib: Face detection
- praat-parselmouth: Prosody analysis
See requirements.txt for complete list.
The YOLOv8 model (yolov8n.pt) is automatically downloaded on first run by the Ultralytics package. You don't need to manually download it.
If you encounter issues:
- Ensure you have internet connection on first run
- The model (~6MB) downloads to Ultralytics cache
- Check firewall settings if download fails
Install FFmpeg:
- Windows: Download from https://ffmpeg.org/download.html
- macOS:
brew install ffmpeg - Linux:
sudo apt-get install ffmpeg
Check your .env file has a valid OPENROUTER_API_KEY.
Try:
- Lowering
target_fps(default is 2) - Processing shorter videos
- Closing other applications
Videos containing multiple languages may produce unexpected clips because Whisper translates everything into English by default. This can result in:
- Loss of context from non-English speech
- Incorrect clip boundaries due to translation timing differences
- Mixed language content being merged incorrectly
Workaround: For better results with multi-language content, process each language segment separately or use the task="transcribe" parameter to keep original language.
Videos longer than 1 hour may consume significant processing time (30-60+ minutes depending on hardware):
- Audio feature extraction scales with video duration
- Video analysis requires processing thousands of frames
- LLM analysis has context window limits for very long videos
Tips for long videos:
- Use lower
target_fps(1 instead of 2) for faster processing - Enable caching to avoid re-processing if you need to re-run
- Consider splitting very long videos into smaller segments
- Ensure sufficient RAM (16GB+ recommended for 1-hour videos)
- Speaker Diarization: Integrate
pyannote.audioorSpeechBrainto segment clips per speaker - Speaker Queries: Enable queries like "Give me all clips where speaker X is explaining something" or "Combine all funny reactions of speaker Y"
- Speaker Embeddings: Integrate speaker identity into LLM scoring for intelligent semantic merges based on who's speaking
- Multi-speaker Analysis: Track speaker transitions and dialogue patterns for better clip boundaries
- Emotion Recognition: Train or integrate pre-trained models for emotion/intensity detection beyond generic audio peaks
- Engagement Scoring: Guide LLM to rank clips not only by volume/motion but by perceived emotional engagement
- Sentiment Analysis: Combine audio emotion with transcript sentiment for deeper understanding
- Facial Expression Analysis: Detect smiles, laughter, surprise in video frames to enhance excitement scoring
- Platform Presets: User-specified clip length preferences (short for TikTok/Reels, longer for podcasts/YouTube)
- Intelligent Merging: LLM can merge multiple peaks while respecting target duration constraints
- Dynamic Segmentation: Automatically adjust clip boundaries based on content density and pacing
- Custom Templates: Save and reuse clip duration strategies for different content types
- Auto-Classification: Detect content type (comedy, sports, interview, tutorial, etc.) and adjust fusion weights accordingly
- Stand-up comedy → Audio-heavy (0.7 audio, 0.3 video)
- Sports/Gaming → Video-heavy (0.3 audio, 0.7 video)
- Interviews/Podcasts → Balanced (0.5 audio, 0.5 video)
- Genre-Specific Models: Fine-tune excitement scoring for different video genres
- Context-Aware Features: Enable/disable specific features based on content type
- Live Stream Support: Extract highlights on-the-fly from live streams or ongoing recordings
- Streaming Inference: Fast scoring models optimized for real-time processing
- Incremental LLM Prompts: Streaming-friendly LLM prompt design for progressive clip selection
- Buffer Management: Smart windowing for continuous audio/video analysis
- Forced Alignment: Combine transcript with precise word-level timestamps
- Subtitle Generation: Auto-generate SRT/VTT files for each extracted clip
- Multi-Language Support: Transcribe and caption in multiple languages
- Styling Options: Customizable subtitle appearance for different platforms (TikTok, YouTube Shorts, Instagram)
- Accessibility: Ensure all clips are accessible with proper closed captions
We're excited about these features and welcome contributions! If you're interested in implementing any of these enhancements, please:
- Open an issue to discuss your approach
- Fork the repository and create a feature branch
- Submit a pull request with comprehensive tests
MIT License - see the LICENSE file for details.
Contributions welcome! Please read our Contributing Guidelines for details on how to submit pull requests, report issues, and contribute to the project.
- YOLOv8 by Ultralytics
- CLIP by OpenAI
- Whisper by OpenAI
- OpenRouter for LLM API access
