Let’s Make-It-3D! Hand in & Microsoft's latest open source 2D-to-3D generation research, Star is over 1k star.

  The heart of the machine column

  Machine zhixin editorial department

  Give you some photos, can you guess what they look like in the three-dimensional world?

  We can easily infer its 3D geometry and appearance from different perspectives with rich visual transcendental knowledge from only one photo. This ability benefits from our in-depth understanding of the visual world. Nowadays, just like human beings, some excellent image generation models, such as Stable Diffusion and Midjourney, also have rich visual transcendental knowledge and show high-quality image generation effects. Based on this observation, the researchers put forward the hypothesis that a high-quality pre-trained image generation model has the same ability as human beings, that is, it can infer 3D content from a real or AI-generated image.

  This task is very challenging, not only to estimate the potential 3D geometric structure, but also to generate unseen textures. Based on the previous assumptions, researchers from Shanghai Jiaotong University, HKUST and Microsoft Research Institute put forward the Make-It-3D method, which creates high-fidelity 3D objects from a single image by using a 2D diffusion model as a 3D-aware prior. The framework does not need multi-view images for training, and can be applied to any input image. This paper was also accepted by ICCV 2023.

  Paper link: https://arxiv.org/pdf/2303.14184.pdf

  Project link: https://make-it-3d.github.io/

  Github link: https://github.com/junshutang/Make-It-3D

  As soon as the paper was published, it triggered a heated discussion on Twitter, and the subsequent open source code accumulated more than 1.1k stars in Github.

  So what are the technical details behind the method?

  When optimizing three-dimensional space, this method is mainly based on two core optimization objectives:

  1. The rendering result under the reference perspective should be highly consistent with the input picture;

  2. The rendering results in a new perspective show the semantics consistent with the input. Among them, the researcher uses BLIP2 model to mark the text on the picture.

  Based on this optimization goal, in one stage, this method randomly samples the camera attitude around the reference angle. The pixel-level constraints are imposed on the rendered image and the reference image from the reference perspective, and the similarity between the image and the text is measured by using the prior information from the pre-trained diffusion model from the new perspective.

  However, it is difficult to describe all the information of a picture only by text, which will make it difficult to completely align the generated 3D model with the reference image. Therefore, in order to enhance the correlation between the generated geometric model and the image, the paper additionally restricts the image similarity between the denoised image and the reference image in the diffusion process, that is, the CLIP coding distance between the images. This method further effectively improves the similarity between the generated model and the picture.

  In addition, the paper also uses the monocular depth estimated from a single map to avoid some geometric ambiguities, such as concave surfaces.

  However, the researchers believe that the texture hidden field obtained by optimization is difficult to completely reconstruct the texture details of the image, such as the fluff on the bear surface and the local color information, which are not reflected in the results generated in one stage. Therefore, this method proposes a two-stage optimization process focusing on texture refinement.

  In the second stage, according to the geometric model obtained in the first stage, this method maps the high-quality texture of the reference image into 3D space. Then focus on enhancing the texture of the occluded area in the reference perspective. In order to realize this process better, this method exports the implicit representation of one stage to the explicit representation-point cloud. Compared with the noise grid derived from Marching Cube, point cloud can provide clearer geometric features, and it is also beneficial to divide occluded areas and non-occluded areas.

  Subsequently, the method focuses on optimizing the texture of the occlusion area. Deferred-Renderer based on UNet structure is used for point cloud rendering, and the prior information from pre-trained diffusion model is also used to optimize the fine texture of occlusion area.

  From left to right are the reference map, the normal map and texture rendering result obtained by one-stage optimization, and the rendering result after two-stage texture refinement.

  This method can also support many interesting applications, including free editing and stylization of 3D textures. And text-driven generation of complex and diverse three-dimensional content.

  tag

  Make-It-3D, as the first method to expand two-dimensional pictures into three-dimensional space, while maintaining the same rendering quality and realism as reference pictures, is committed to creating three-dimensional content with the same visual effect as two-dimensional pictures. Researchers hope that through the work of Make-It-3D, academia or industry will pay more attention to the scheme of converting 2D to 3D and accelerate the development of 3D content creation. For more experimental details and results of the method, please refer to the content of the paper and the home page of the project.

  Next, let's Make-It-3D easily!

  report

  Comment 0