Neuro Symbolic Video Search

The surge in video data necessitates advanced frame extraction tools. Foundation models like VideoLLaMA and ViCLIP falter in long-term reasoning, conflating frame perception with temporal analysis. We propose separating semantic understanding, using vision-language models for individual frames, from temporal reasoning, employing state machines and temporal logic. This approach significantly enhances complex event identification, improving F1 scores by 9-15% on self-driving datasets like Waymo and NuScenes compared to GPT4-based reasoning, showcasing the importance of decoupling these processes for effective scene identification.

Diffusion Based Adaptive Video Compression

Recent advances have enabled diffusion models to efficiently compress videos while maintaining high visual quality. By storing only keyframes and using these models to interpolate frames during playback, this method ensures high fidelity with minimal data. The process is adaptive, balancing detail retention and compression ratio, and can be conditioned on lightweight information like text descriptions or edge maps for improved results.

Distill Reinforcement Learning

Distilling Reinforcement Learning

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×