unified generative-predictive modeling for 4D scene understanding
A single diffusion transformer treats RGB video, depth, and camera rays as symmetric modalities, casting visual generation and geometric prediction as the same conditional-completion problem. The result is one model that does camera-controlled video synthesis, depth, and pose --- and improves itself by drawing more samples at inference time.
17 min read · 2025