2025-06-10

1. Research

Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification

Minkyu Choi*, S P Sharan*, Sahil Shah, Harsh Goel, Mohammad Omama, and Sandeep Chinchali
Computer Vision and Pattern Recognition (CVPR), 2025

Introduction

In recent years, generative models that create videos from text prompts—known as text-to-video (T2V) models—have advanced significantly. Models like Sora, Gen-3, MovieGen, and CogVideoX are now being used in diverse fields ranging from entertainment to critical applications like autonomous driving and robotics. However, while visually impressive, these models often struggle with accurately capturing temporal sequences described by their prompts.

Why Temporal Fidelity Matters

For many applications, especially those involving safety-critical tasks, temporal fidelity—the correct timing and sequencing of events—is as important as visual realism. Imagine retraining an autonomous vehicle’s motion planning system using synthetic videos generated from a prompt like, “A truck appears in frame 10 and veers in front of me onto my lane within 2 seconds.” If the timing and sequence of the truck’s movements aren’t accurately represented, it could lead to dangerous decisions in real-world scenarios.

Unveiling NeuS-V: Neuro-Symbolic Verification

Fig. 1 Current generative video evaluation methods struggle with temporal fidelity. NeuS-V converts prompts into Temporal Logic specifications and formally verifies them against a video automaton. The upper video aligns with the prompt's temporal sequencing, while the lower video, despite being visually appealing, fails to do so. Unlike VBench, NeuS-V effectively differentiates between them

To address the limitations of existing evaluation methods, we introduce NeuS-V, a groundbreaking evaluation technique that employs neuro-symbolic formal verification. NeuS-V translates natural language prompts into formal Temporal Logic (TL) specifications and generates an automaton representing the video’s events and sequences. By formally verifying this automaton against the TL specification, NeuS-V rigorously evaluates text-to-video alignment, significantly outperforming existing metrics.

How NeuS-V Works

NeuS-V operates through several key steps:

Prompt Translation: Text prompts are converted into Temporal Logic specifications using Prompt Understanding via temporal Logic Specification (PULS).
Semantic Scoring: Vision-language models assign confidence scores to each atomic proposition extracted from the prompt.
Automaton Construction: Videos are represented as automata, capturing the sequential and probabilistic nature of frame-to-frame transitions.
Formal Verification: Using probabilistic model checking, NeuS-V computes the satisfaction probability of the automaton meeting the prompt’s TL specification.

Fig. 2 Spatio-temporal and semantic measurements between a text prompt and a video by NeuS-V. We first decompose the text prompt to TL specification $\Phi$, then transform the synthetic video into an automaton representation $\mathcal{A}_{\mathcal{V}}$. Finally, we calculate the satisfaction probability by probabilistically checking the extent to which $\mathcal{A}_{\mathcal{V}}$ satisfies $\Phi$.

Superior Evaluation Results

NeuS-V’s formal approach has shown over five times greater correlation with human evaluations compared to traditional visual quality metrics such as VBench. This indicates NeuS-V’s strength in capturing temporal fidelity, making it particularly effective for scenarios where timing and sequence accuracy are critical.

Fig. 3 Correlation with Human Annotations. NeuS-V consistently shows a stronger alignment with human text-to-video annotations (Pearson coefficients displayed at the top of each plot).

Benchmark and Dataset Contributions

To further the development of reliable T2V models, NeuS-V includes a comprehensive benchmark suite with temporally complex prompts designed to rigorously test temporal coherence and event sequencing. These benchmarks, along with the NeuS-V framework and dataset, are publicly available to facilitate continued advancements in the field.

Tab. 1 Benchmarking SOTA Text-to-Video Models. Performance metrics reflect the full 360-prompt set, while correlations to human evaluations (in parentheses) are computed on a 160-prompt subset. NeuS-V enjoys high correlation across all themes and complexities.

Conclusion and Future Directions

NeuS-V’s integration of neuro-symbolic verification and Temporal Logic addresses a crucial gap in text-to-video generation evaluation. It opens a path toward more reliable generative models, particularly in domains where precise temporal alignment is paramount. Future directions involve refining the NeuS-V framework to penalize unintended elements in generated videos and further exploring prompt optimization and model training methodologies.

Citation

@InProceedings{Choi_2025_CVPR,
    author    = {Sharan, S P and Choi, Minkyu and Shah, Sahil and Goel, Harsh and Omama, Mohammad and Chinchali, Sandeep},
    title     = {Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {8395-8405}
}