3D is an industrial issue. It's not enough for models to perform well visually; they also need to meet specific industrial standards, such as material representation, polygon planning, and structural rationality. If they can't align with human industrial standards, the generated results will require extensive adjustments and be difficult to apply in production.
Just as large language models (LLMs) need to align with human values, AI models for 3D generation need to align with complex 3D industrial standards.
A more practical approach has emerged: 3D Native
One of the works nominated for the Best Paper Award from the MARS Lab at ShanghaiTech University - CLAY - has shown the industry a viable solution to the above problems, namely 3D Native.
We know that in the past two years, the technical routes for 3D generation can be roughly divided into two categories: 2D upscaling and native 3D.
2D upscaling is the process of three-dimensional reconstruction through 2D diffusion models combined with methods like NeRF. Because they can be trained on large amounts of 2D image data, these models often generate diverse results. However, due to the lack of 3D prior capabilities in 2D diffusion models, these models have limited understanding of the 3D world and tend to generate geometrically unreasonable results (such as humans or animals with multiple heads).
A series of recent multi-view reconstruction works have alleviated this problem to some extent by adding multi-view 2D images of 3D assets to the training data of 2D diffusion models. However, the limitation is that these methods start with 2D images, so they focus on the quality of generated images rather than trying to maintain geometric fidelity, resulting in generated geometries that are often incomplete and lack detail.
In other words, 2D data ultimately only records one aspect or projection of the real world. No matter how many angles of images are used, they cannot fully describe three-dimensional content. Therefore, what the model learns still has a lot of missing information, and the generated results still need a lot of correction, making it difficult to meet industrial standards.
Considering these limitations, ### the CLAY research team chose another path - 3D Native.
This approach directly trains generative models from 3D datasets, extracting rich 3D priors from various 3D geometric shapes. As a result, the model can better "understand" and preserve geometric features.
However, these models also need to be large enough to "emerge" powerful generative capabilities, and larger models require training on larger datasets. As is well known, high-quality 3D datasets are very scarce and expensive, which is the first problem that the native 3D approach needs to solve.
In the CLAY paper, researchers adopted a customized data processing pipeline to mine various 3D datasets and proposed effective techniques to scale up the generative model.
Specifically, their data processing pipeline starts with a custom remeshing algorithm that converts 3D data into watertight meshes, meticulously preserving important geometric features such as hard edges and flat surfaces. Additionally, they used GPT-4V to create detailed annotations highlighting important geometric characteristics.
After going through the above processing pipeline, numerous datasets converged into the super-large 3D model dataset used for training the CLAY model. Previously, due to different formats and lack of consistency, these datasets had never been used together to train 3D generative models. The processed combined dataset maintains consistent representations and coherent annotations, which can greatly improve the generalization of generative models.
CLAY, trained using this dataset, includes a 3D generative model with as many as 1.5 billion parameters. To ensure minimal information loss between the conversion from the dataset to implicit representation and then to output, they spent a long time screening and refining, eventually exploring a new and efficient 3D representation method. Specifically, they adopted the neural field design from 3DShape2VecSet to describe continuous complete surfaces, combined with a custom multi-resolution geometric VAE for processing point clouds at different resolutions, allowing it to adapt to latent sizes.
To facilitate model scaling, CLAY adopted a minimalist latent diffusion Transformer (DiT). It is composed of Transformers and can adapt to latent vector sizes, possessing scalability capabilities. Additionally, CLAY introduced a progressive training scheme that trains by gradually increasing latent vector sizes and model parameters.
Ultimately, CLAY achieved precise control over geometry, allowing users to control the complexity, style, and even characters of geometric generation by adjusting prompts. Compared to previous methods, CLAY can quickly generate detailed geometries, ensuring important geometric features such as flat surfaces and structural integrity.
Some results in the paper fully demonstrate the advantages of the native 3D approach. The figure below shows the top three nearest samples retrieved from the dataset by the researchers. The high-quality geometries generated by CLAY match the prompts but differ from the samples in the dataset, demonstrating sufficient richness and characteristics of large model emergent capabilities.
To make the generated digital assets directly usable in existing CG production pipelines, researchers have