Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior

Abstract

Text-to-3D generation has achieved remarkable success in digital object creation, attributed to the utilization of large-scale text-to-image diffusion models. Nevertheless, there is no paradigm for scaling up the methodology to urban scale. The complexity and vast scale of urban scenes, characterized by numerous elements and intricate arrangement relationships, present a formidable barrier to the interpretability of ambiguous textual descriptions for effective model optimization. In this work, we surmount the limitations via a paradigm shift in the current text-to-3D methodology, accomplished through a compositional 3D layout representation serving as an additional prior. The 3D layout comprises a set of semantic primitives with simple geometric structures (e.g., cuboids, ellipsoids, and planes), and explicit arrangement relationships. It complements textual descriptions, and meanwhile enables steerable generation. Based on the 3D layout representation, we propose two modifications to the current text-to-3D paradigm (1) We introduce Layout-Guided Variational Score Distillation (LG-VSD) to address model optimization inadequacies. It incorporates the geometric and semantic constraints of the 3D layout into the the fabric of score distillation sampling process, effectuated through an elegant formula extension into a conditional manner. (2) To handle the unbounded nature of the urban scenes, we represent the 3D scene with a Scalable Hash Grid structure, which incrementally adapts to the growing scale of urban scenes. Extensive experiments substantiate the robustness of our framework, showcasing its capability to scale text-to-3D generation to large-scale urban scenes that cover over 1000m driving distance for the first time. We also present various scene editing demonstrations (e.g., style editing, object manipulation, etc.), showing the complementary powers of both 3D layout prior and text-to-image diffusion models in our framework for steerable urban scene generation.

Publication
Preprint
Fan Lu
Fan Lu
Ph.D student

My research interests include 3D Computer Vision, Point Cloud Registration and Neural Rendering.