2026-07-02 · engineering
How Video2Any decides what counts as a slide
Every video-to-slides tool faces the same question: out of thousands of frames, which few are actually new slides? Most tools answer with a fixed rule, one frame every N seconds, or one fixed similarity cutoff, and the result is either 400 near-duplicates or a deck missing half its pages. Video2Any answers by reading the video first. Here is the whole pipeline, with real numbers.
Step 1: measure change, cheaply
We sample the video at your chosen interval and downscale each frame to 160 by 90 pixels. Two consecutive samples are compared block by block: for each 8 pixel block we take the mean absolute RGB difference, and a block "changed" when that mean crosses a noise floor. The score for a frame pair is simply the fraction of blocks that changed. Downscaling plus block averaging kills compression noise for free, and the whole comparison costs microseconds. No GPU, no model, no upload.
Step 2: classify the video before cutting it
The interesting problem is choosing the cutoff for "this is a new slide." A fixed number cannot work, because there are three different kinds of video:
- Slide decks. Most frame pairs are nearly identical and a few jump sharply. The score distribution is bimodal, so we find the split with a 1-D Otsu pass and set the threshold inside the gap. In our tests a white slide where only one line of text changes scores about 0.014, and calibration lands the threshold at 0.007: the edit is caught.
- Static video. One title card and an hour of audio. Every score hovers near zero, nothing is bimodal, and the right behavior is to keep exactly one frame. The noise floor already guarantees that.
- Camera footage. A talking head moves in every single frame. On an 80 second test clip, the median frame pair scored 0.26, meaning a quarter of the image "changed" every second just from ordinary motion. Any fixed threshold tuned for slides keeps everything: our first build kept 77 frames out of 80.
The regime decider turned out to be one robust statistic: the share of frame pairs that are visually still. Above 60 percent still, treat it as a deck and run Otsu. Nearly all still, keep one frame. Otherwise it is motion footage, and the threshold becomes the median score plus two median absolute deviations: high enough that ordinary motion never triggers, low enough that a real scene change always does.
The real-world result
That 80 second, 1080p test clip: the motion rule set the threshold at 0.514 and extracted 10 slides, one per actual scene change, from the same footage where the naive rule had produced 77. A 12 second synthetic deck still produces exactly its 4 slides, and an 8 second noise-heavy still produces exactly 1. All of this runs as an automated test suite against ffmpeg-synthesized fixtures on every change.
Why not machine learning?
Because the problem does not need it. Slide changes are a signal processing problem with clean structure, and a solution built from arithmetic runs in any browser tab at full speed, offline, on a five year old laptop. The one place we do use a model is optional: Whisper for the subtitle track, and even that runs locally. If a statistic does the job, ship the statistic.
Try it on one of your own recordings: the mode it picked is shown right in the interface after calibration.