A deep technical tutorial exploring FlexClip's auto clipping pipeline: scene detection with PySceneDetect, transcript alignment via Whisper, highlight extraction NLP with BERT, and multimodal fusion. Build your own video clipper step by step with real code.
What You Will Build
You will build a local video clipper that mirrors what FlexClip's auto clipping pipeline does under the hood. Given any long video file (a podcast episode, a product demo, or a recorded presentation), your pipeline will automatically identify the most shareable moments and splice them into short clips. You will use PySceneDetect for scene boundaries, OpenAI Whisper for speech to text for video, BERT for highlight extraction NLP, and FFmpeg for the final trim. By the end you will understand the trade offs FlexClip's engineers made and know exactly how to customize each stage for your own projects.
Prerequisites
- Python 3.10 or newer installed on your machine
- FFmpeg installed and available on your PATH
- A GPU is helpful for Whisper but not required (CPU works, just slower)
- A short test video (30 seconds to 5 minutes works best for initial debugging)
- Basic familiarity with Python and running terminal commands
1. Overview of FlexClip's Auto Clipping Pipeline
FlexClip's product lets a user upload a raw video and receive polished short clips with zero editing. The magic is a multi stage pipeline that combines computer vision, natural language processing, and audio analysis. Understanding this architecture is the first step toward building your own version.
The pipeline works in five stages. First, it ingests the video and splits it into chunks based on visual scene changes. Second, it transcribes the audio into text with word level timestamps. Third, it analyzes that text to score each sentence or phrase by importance. Fourth, it aligns the high scoring text segments with the scene boundaries from stage one. Fifth, it trims the original video at those aligned boundaries and outputs the clips.
A critical design choice here is modularity. FlexClip's team could swap models as better ones emerged. You should do the same. The scene detector could be a simple histogram comparison or a deep learning model like TransNetV2. The transcription engine could be Whisper or a smaller model like Wav2Vec2. The NLP scorer could be BERT or a lightweight TF IDF approach. The pipeline architecture stays the same; only the model weights change.
Here is the skeleton of the pipeline in Python.
import subprocess
import json
from pathlib import Path
def auto_clip_pipeline(video_path: str, output_dir: str = "clips"):
"""Run the full auto clipping pipeline."""
# Stage 1: Scene detection
scene_list = detect_scenes(video_path) # returns list of (start_sec, end_sec)
# Stage 2: Transcription
transcript = transcribe_audio(video_path) # returns list of dicts with 'text', 'start', 'end'
# Stage 3: Highlight scoring
scored_sentences = score_highlights(transcript) # returns list with 'text', 'score'
# Stage 4: Multimodal fusion
clip_segments = fuse_signals(scene_list, scored_sentences, top_k=3)
# Stage 5: Trim video
for idx, (start, end) in enumerate(clip_segments):
trim_video(video_path, output_dir, idx, start, end)
print(f"Created {len(clip_segments)} clips in {output_dir}/")
Each function will be filled in the sections below. The pipeline is intentionally sequential: each stage feeds the next. Later you can parallelize the first three stages for production, but the sequential version is easier to debug and understand.
This kind of AI automation workflow is exactly what we explore in our guide on building an AI agent workflow for research, but here we go deeper into the code.
2. Scene Detection: Finding Boundaries with Temporal Analysis
The first job of the auto clipping pipeline is to find where one scene ends and another begins. FlexClip uses video scene detection to split a video into coherent visual chunks. A scene boundary is a point where the content changes significantly: a cut between two camera angles, a transition from one location to another, or a change in lighting or background.
The simplest and most reliable approach is histogram based detection. Every frame is converted to a color histogram, typically in the HSV color space. If the histogram of frame N differs from frame N minus 1 beyond a threshold, a scene change is declared. This works well for hard cuts but misses gradual transitions like dissolves or fades.
A more sophisticated approach uses a pretrained convolutional neural network like TransNetV2, which is trained specifically on shot boundary detection datasets. It catches both hard cuts and gradual transitions with higher accuracy, but it requires GPU inference and adds latency.
FlexClip likely uses a hybrid approach for production: fast threshold based detection for the first pass, then a CNN verification on borderline candidates. For your pipeline, start with PySceneDetect, a Python library that wraps both approaches.
Here is how to run scene detection with the content detector (histogram based).
from scenedetect import open_video, SceneManager
from scenedetect.detectors import ContentDetector
def detect_scenes(video_path: str, threshold: float = 27.0):
"""Return list of (start_sec, end_sec) for each detected scene."""
video = open_video(video_path)
scene_manager = SceneManager()
scene_manager.add_detector(ContentDetector(threshold=threshold))
scene_manager.detect_scenes(video)
scene_list = scene_manager.get_scene_list()
# Convert to seconds
scenes_sec = []
for scene in scene_list:
start = scene[0].get_seconds()
end = scene[1].get_seconds()
scenes_sec.append((start, end))
return scenes_sec
Threshold tuning matters. A threshold of 27.0 is a good starting point for typical videos. Lower values detect more subtle changes (including camera movement or lighting shifts that are not true scene changes). Higher values miss some cuts. Test on your own content. If your video is a talking head with minimal background changes, you may need a lower threshold. If it is fast paced with many cuts, a higher threshold prevents over segmentation.
The output looks like this.
# Example output for a 10 minute podcast:
# [(0.0, 45.2), (45.2, 187.8), (187.8, 340.1), (340.1, 600.0)]
# Four scenes found, each with a start and end time in seconds.
One common pitfall: if your video has black frames between segments (common in compiled recordings), the detector splits at every black frame. Add a minimum scene length filter (say 10 seconds) to discard pointless micro scenes. You will implement that filter in the fusion stage.
3. Speech to Text: Transcribing Audio with Whisper
With scene boundaries in hand, you now need to know what people are saying. FlexClip uses speech to text for video to generate a full transcript with timestamps. This is the backbone of the highlight extraction step.
OpenAI Whisper is the obvious choice. It is open source, supports 99 languages, and handles noisy audio surprisingly well thanks to its transformer architecture trained on 680,000 hours of multilingual data. The trade off is size: the large model is almost 3 GB and runs slowly on CPU. The base or small model runs much faster and still produces solid results for English content.
To get word level timestamps (crucial for aligning highlights to exact video positions), use whisper timestamped. Standard Whisper outputs segment level timestamps only. Whisper timestamped runs a dynamic time warping pass after transcription to assign each word a precise start and end time.
Install it first.
pip install whisper-timestamped
Then run the transcription.
import whisper_timestamped as wt
def transcribe_audio(video_path: str, model_size: str = "base"):
"""Transcribe video audio and return list of word dicts with 'text', 'start', 'end'."""
audio = wt.load_audio(video_path)
model = wt.load_model(model_size, device="cpu") # use "cuda" if GPU available
result = wt.transcribe(model, audio, language="en")
words = []
for segment in result["segments"]:
for word in segment["words"]:
words.append({
"text": word["text"],
"start": word["start"],
"end": word["end"]
})
return words
Why not just use the python speech recognition library? Because it does not give you word level timestamps, and it fails on background music or multiple speakers. Whisper handles both gracefully. The transformer architecture uses attention to focus on the speech signal even when the audio is messy.
The output is a list of words, each with start and end in seconds.
# Example output:
# [
# {"text": "Today", "start": 12.3, "end": 12.6},
# {"text": "we", "start": 12.6, "end": 12.8},
# {"text": "are", "start": 12.8, "end": 12.9},
# {"text": "launching", "start": 12.9, "end": 13.4},
# ...
# ]
Group these words into sentences for the NLP stage. A simple heuristic: group by punctuation or by gaps longer than 0.3 seconds. We will do that in the next section.
If you are building this for a client facing application, you might want to explore how to connect Claude AI to your apps to automate the transcript review step without manual checks.
4. NLP Highlight Extraction: Identifying Key Moments
You now have a transcript grouped into sentences. The next job is highlight extraction NLP: scoring each sentence by how important or engaging it is. FlexClip likely uses a fine tuned BERT model trained on engagement metrics from millions of clips. You can approximate this with a pretrained BERT model plus a simple scoring heuristic.
Why BERT? BERT (Bidirectional Encoder Representations from Transformers) captures context from both sides of a word. A sentence like "This is a game changer" scores higher for importance than "Let me check my notes" because BERT understands the semantic weight of "game changer" in context. A bag of words approach would treat both sentences equally if they contain the same keywords.
Here is the approach. For each sentence in the transcript, compute a BERT embedding and compare it to a set of "high importance" reference sentences (for example, sentences containing launch announcements, key statistics, or opinionated statements). Use cosine similarity as the importance score. This is few shot: you provide a few examples of what a highlight looks like, and BERT finds similar sentences.
Here is a simpler rule based alternative that works well without any training data. It combines three signals.
- Keyword presence: Sentences containing words like "announce", "launch", "first ever", "important", "breakthrough", "revenue", "growth", "new feature" get a boost.
- Sentence position: The first and last 20 percent of a video are statistically more likely to contain key messages.
- Sentence length: Very short sentences are often filler. Very long sentences are often rambling. The sweet spot is 8 to 25 words.
HIGHLIGHT_KEYWORDS = [
"announce", "launch", "first ever", "important", "breakthrough",
"revenue", "growth", "new feature", "excited to", "proud to",
"milestone", "record", "exclusive", "game changer"
]
def score_sentence(sentence: dict, total_duration: float, sentence_index: int, total_sentences: int):
"""Return a score between 0 and 1 for a given sentence."""
text = sentence["text"].lower()
score = 0.0
# Keyword boost
for kw in HIGHLIGHT_KEYWORDS:
if kw in text:
score += 0.15
# Position boost: first 20% and last 20%
rel_pos = sentence_index / total_sentences
if rel_pos < 0.2 or rel_pos > 0.8:
score += 0.1
# Length penalty
word_count = len(text.split())
if 8 <= word_count <= 25:
score += 0.1
elif word_count > 40:
score -= 0.05
# Clamp to [0, 1]
return max(0.0, min(1.0, score))
The trade off. The rule based method is fast, interpretable, and requires no GPU. But it misses nuance. A sentence like "This is a huge day for our company" scores lower than it should because it contains none of the exact keywords. BERT solves this by understanding synonyms and context. If you have the resources, use a small BERT model (distilbert) for scoring. If you need speed, the rule based approach works.
For a deeper look at how AI can surface important content from your own business data, read our guide on how to build a no code second brain with Claude.
5. Multimodal Fusion: Combining Visual and Textual Signals
You now have scene boundaries from the visual analysis and scored sentences from the text analysis. The multimodal video analysis stage fuses these two signals to pick the exact start and end times for each clip.
This is where the pipeline gets its intelligence. A scene might look visually interesting but contain filler dialogue. A sentence might score high on importance but occur during a visually static shot. The fusion step balances both.
Here is a simple weighted scoring approach. For each candidate clip window defined by a scene boundary, compute a visual interest score and a textual importance score, then combine them linearly.
def fuse_signals(scene_list, scored_sentences, top_k=3, visual_weight=0.4, text_weight=0.6):
"""Return the top_k clip segments ranked by combined score."""
candidates = []
for scene_start, scene_end in scene_list:
# Gather sentences that fall within this scene
scene_sentences = [s for s in scored_sentences
if s["start"] >= scene_start and s["end"] <= scene_end]
if not scene_sentences:
continue
# Text score: average of sentence scores in this scene
avg_text_score = sum(s["score"] for s in scene_sentences) / len(scene_sentences)
# Visual score: simple motion metric
# (You can replace this with face detection or object presence)
# For now we use scene duration: shorter scenes tend to be more dynamic
visual_score = 1.0 / (1.0 + (scene_end - scene_start) / 60.0)
# Combined score
combined = visual_weight * visual_score + text_weight * avg_text_score
candidates.append({
"start": scene_start,
"end": scene_end,
"score": combined,
"text": " ".join(s["text"] for s in scene_sentences)
})
# Sort by score descending and return top_k
candidates.sort(key=lambda x: x["score"], reverse=True)
return [(c["start"], c["end"]) for c in candidates[:top_k]]
Why linear combination works. FlexClip's production system probably uses a learned fusion model (an attention based network that learns which modality to trust per video). But for your pipeline, a linear combination with tuned weights is surprisingly effective. The key insight: textual importance is usually more predictive of shareability than visual dynamism. Set text weight higher than visual weight (0.6 vs 0.4) unless your video is heavy on action and light on dialogue.
The visual score can be improved with face detection. Scenes with faces consistently receive higher engagement. You can add a face detection model like MTCNN or RetinaFace and boost the visual score for scenes that contain faces. This is especially useful for interview or vlog content.
6. Building the Clipper: Step by Step Implementation
Now assemble everything into a complete build video clipper pipeline. This section ties the previous stages together with FFmpeg for the final trim.
Step 1: Extract frames and audio
You need the audio as a separate file for Whisper. Use ffmpeg python.
import ffmpeg
def extract_audio(video_path: str, audio_path: str = "/tmp/temp_audio.wav"):
"""Extract audio from video file as 16kHz mono WAV."""
ffmpeg.input(video_path).output(
audio_path,
acodec="pcm_s16le",
ac=1,
ar=16000
).run(overwrite_output=True, quiet=True)
return audio_path
Step 2: Run scene detection
Use the detect_scenes function from section 2.
scene_list = detect_scenes("my_video.mp4", threshold=27.0)
Step 3: Transcribe and align timestamps
Use the transcribe_audio function from section 3, but pass the extracted audio file.
audio_path = extract_audio("my_video.mp4")
words = transcribe_audio(audio_path, model_size="base")
sentences = group_words_into_sentences(words) # a simple helper you'll write
Step 4: Score each sentence
Use the scoring function from section 4.
total_sentences = len(sentences)
scored = []
for i, sent in enumerate(sentences):
sent["score"] = score_sentence(sent, total_duration, i, total_sentences)
scored.append(sent)
Step 5: Fuse and trim
Use the fusion function from section 5, then trim with FFmpeg.
def trim_video(video_path: str, output_dir: str, idx: int, start: float, end: float):
"""Trim a segment from the video and save as a clip."""
output_path = f"{output_dir}/clip_{idx:03d}.mp4"
ffmpeg.input(video_path, ss=start, to=end).output(
output_path,
vcodec="libx264",
acodec="aac"
).run(overwrite_output=True, quiet=True)
print(f"Saved clip {idx} at {output_path} ({end-start:.1f}s)")
clip_segments = fuse_signals(scene_list, scored, top_k=3)
for idx, (start, end) in enumerate(clip_segments):
trim_video("my_video.mp4", "output_clips", idx, start, end)
Run the script and check the output_clips folder. You should have three short MP4 files representing the most interesting moments from your video. Each clip should start and end at a clean scene boundary and contain high importance dialogue.
7. Optimization, Pitfalls, and Next Steps
Your pipeline works. Now let me help you make it fast, robust, and ready for real world use. Video clipping optimization is about handling scale and edge cases.
Performance Bottlenecks
Transcription is the slowest stage. Whisper large takes roughly 10x real time on CPU. A 10 minute video takes 100 minutes. Use the base or small model for production. If you have a GPU, pass device="cuda" to cut that to near real time. Model quantization (using the int8 version of Whisper) reduces memory and speeds inference with minimal accuracy loss.
Scene detection is fast but can become a bottleneck if you process many videos in parallel. The histogram based detector runs at 50x real time. The TransNetV2 deep learning alternative runs at about 2x real time on GPU. Use the histogram detector as a first pass and run TransNetV2 only on videos where the threshold based result looks suspicious.
Common Pitfalls
Missed scene boundaries in subtle transitions. A slow crossfade or a fade to black is invisible to the histogram detector. PySceneDetect's AdaptiveDetector handles these better. Switch to it if your content uses many dissolves.
Incorrect word alignment with background music. Whisper sometimes hallucinates words during music segments or assigns incorrect timestamps. A workaround: compute the audio energy per frame and mask out segments where music dominates. Only run transcription on frames where speech energy exceeds music energy. This is called voice activity detection (VAD). The silero vad library does this in one line.
Output clips that are too short or too long. Set a minimum clip duration (8 seconds) and a maximum (90 seconds). Discard candidates outside this range in the fusion step. Short clips feel rushed. Long clips lose the audience.
Next Steps for Your Pipeline
Here are three extensions that move you closer to FlexClip's production quality.
- Add face detection for speaker emphasis. Use RetinaFace to detect faces throughout the video. Scenes where a face is centered and looking at the camera score higher. This is especially valuable for interview and talking head content.
- Integrate an LLM for summary based clipping. Instead of scoring sentences individually, pass the full transcript to an LLM (Claude or ChatGPT) and ask it to suggest clip boundaries with a reason. This adds context awareness. An LLM can identify a question answer pair as a highlight even if the individual sentences score low.
- Build a web UI. Wrap the pipeline in a simple Flask or FastAPI app with a drag and drop uploader. Connect it to a storage service like S3. Let your users upload a video and receive clips via email. This is exactly the kind of automated customer support with AI agents workflow that saves hours for content teams.
If you are building this for a team that also needs automated reporting, check out our guide on how to automate weekly reports with Claude and Google Sheets to extend your pipeline with a summary delivery mechanism.
The modular pipeline you built today is not a toy. It is the same architectural pattern that powers FlexClip and similar tools. The models will improve. The weights will change. But the pipeline of scene detection, transcription, highlight scoring, and fusion will remain the blueprint. You now know that blueprint end to end.
Cover photo by Luke Jones on Unsplash.
Frequently Asked Questions
Does FlexClip use Whisper for its speech to text, or do they have a custom model? +
FlexClip likely uses a proprietary fine tuned model similar to Whisper but optimized for their specific video types and latency requirements. For your own pipeline, Whisper is the best open source alternative that gives you word level timestamps needed for precise clip boundaries.
How do I handle videos with multiple speakers for highlight extraction? +
Speaker diarization separates who spoke when. Add pyannote audio to your pipeline to label each sentence with a speaker ID. Then you can prioritize highlights from the main speaker (the host or presenter) over secondary speakers (guests or audience members).
Can I run this pipeline on a server with no GPU? +
Yes. Use Whisper's tiny or base model with CPU. A 10 minute video takes about 10 to 15 minutes to transcribe on a modern CPU. For scene detection and scoring, no GPU is needed. Consider quantizing Whisper to int8 for a 2x speedup on CPU with negligible accuracy loss.
Lucas Oliveira