Generating 3D object shapes is one of the most challenging and significant tasks in 3D vision and computer graphics. Recent works have introduced useful approaches that create high-quality and diverse 3D objects, leveraging advanced neural representations and generative models. However, these methods require extensive training time and resources, often taking at least 2 to 3 days with over 4 high-end GPUs. In addition, there is no open large model that can be fine-tuned for efficient training due to the scarcity and high computational cost of 3D data. In this thesis, I propose a method for directly fine-tuning a 2D large vision model for 3D mesh generation based on triplane representation. Triplanes store encoded 3D geometric information in the form of 2D feature maps, serving as a bridge between the 2D and 3D domains. Although pretrained parameters of the large vision model are optimized for the 2D domain, they can be used for initializing the parameters of the 3D generative model. By overcoming the different nature of 2D image latents and triplane latents for 3D shapes, this approach significantly reduces the time required to learn 3D data. I also provide experiments and analyses including additional parameter- efficient fine-tuning methods. The proposed fine-tuning approaches achieve much faster convergence with a single GPU. Moreover, the parameter-efficient fine-tuning applied models not only have a much smaller number of trainable parameters to store but also enable switching tasks by simply changing the adapted weight parameters, while maintaining no additional latency during inference.
Publisher
Ulsan National Institute of Science and Technology