By Dave DeFusco
In artificial intelligence, creating a flawless video edit, whether swapping skies in a sunset scene or generating a realistic animation from a still image, is no longer the main challenge. The real test is figuring out how well those edits were done. Until now, the industry has relied on tools that compare frames to captions or measure pixel changes, but they often fall short, especially when videos change over time.
Enter SST-EM, a new evaluation framework developed by Katz School researchers who presented their work in February at the prestigious IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). SST-EM stands for Semantic, Spatial and Temporal Evaluation Metric. It鈥檚 not a video editing tool itself, but a way to measure how well AI models perform the task of editing video鈥攕omething no current method does effectively on its own.
鈥淲e realized that the tools currently used to evaluate video editing models are fragmented and often misleading,鈥 said Varun Biyyala, a lead author of the study, 2024 graduate of the M.S. in Artificial Intelligence and an industry professor at the Katz School. 鈥淵ou could have a video that looks great in still frames but the moment you play it, the movement looks unnatural or the story doesn鈥檛 match the editing prompt. SST-EM fixes that.鈥
Most existing evaluation systems rely on a model called CLIP, which compares video frames to text prompts to measure how closely an edited video matches a desired concept. While useful, these tools fall short in a key area: time.
鈥淐LIP scores are good for snapshots, not for full scenes,鈥 said Bharat Chanderprakash Kathuria, an artificial intelligence master's student and one of the lead researchers. 鈥淭hey don鈥檛 evaluate how the video flows, whether objects move naturally or if the main subjects stay consistent. And that鈥檚 exactly what human viewers care about.鈥
The team also pointed out that CLIP鈥檚 training data is often outdated or biased, limiting its usefulness in modern, dynamic contexts. To fix these shortcomings, the team designed SST-EM as a four-part evaluation pipeline:
- Semantic Understanding: First, a Vision-Language Model (VLM) analyzes each frame to see if it matches the intended story or editing prompt.
- Object Detection: Using state-of-the-art object detection algorithms, the system tracks the most important objects across all frames to ensure continuity.
- Refinement by AI Agent: A large language model (LLM) helps refine which objects are the focus, similar to how a human might identify the main subject of a scene.
- Temporal Consistency: A Vision Transformer (ViT) checks that frames flow smoothly into each other鈥攏o jerky movements, disappearing characters or warped backgrounds.
鈥淎ll four parts feed into a single final score,鈥 said Jialu Li, a co-author of the study and an artificial intelligence master鈥檚 student. 鈥淲e didn鈥檛 just guess how to weight them. We used regression analysis on human evaluations to determine what mattered most.鈥
That human-centered approach is what sets SST-EM apart. The team ran side-by-side comparisons between SST-EM and other popular metrics across multiple AI video editing models like VideoP2P, TokenFlow, Control-AVideo and FateZero. The results were clear: SST-EM came closest to human judgment every time.
The SST-EM score achieved near-perfect correlation with expert ratings, beating out every other metric on measures like imaging quality, object continuity and overall video coherence. On Pearson correlation, which measures linear similarity, SST-EM scored 0.962鈥攈igher than even metrics designed solely for image quality. In Spearman and Kendall rank correlation tests, which judge how closely the system鈥檚 rankings match human rankings, SST-EM scored a perfect 1.0.
鈥淭hese numbers are not just good; they鈥檙e remarkable,鈥 said Dr. Youshan Zhang, assistant professor of artificial intelligence and computer science. 鈥淭hey show that SST-EM evaluates video editing the way humans do鈥攂y considering not just what鈥檚 in a frame, but how frames connect to tell a coherent story.鈥
The team has made their code openly available on GitHub, inviting researchers and developers worldwide to test and refine the system. They鈥檙e also planning enhancements, including better handling of fast scene changes, cluttered backgrounds and subtle object movements.
鈥淲e see SST-EM as the foundation for the next generation of video evaluation tools,鈥 said Dr. Zhang. 鈥淎s video content grows more complex and ubiquitous, we need metrics that reflect how people actually watch and judge video. This is a step in that direction.鈥