posts

writings on vision, generative models, and 3D

Unified Generative-Predictive Modeling for 4D Scene Understanding

A single diffusion transformer treats RGB video, depth, and camera rays as symmetric modalities, casting visual generation and geometric prediction as the same conditional-completion problem. The result is one model that does camera-controlled video synthesis, depth, and pose, and improves itself by drawing more samples at inference time.

23 min read · 2026