LLaMA Mesh Unifying 3D Mesh Generation with Language Models

Показать описание

This paper presents LLaMA-MESH, a novel framework that enables large language models (LLMs) to generate 3D meshes by representing them as plain text. This approach offers the advantage of leveraging the spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and enabling conversational 3D generation and mesh understanding.

Challenges of Integrating 3D Mesh Generation into LLMs:
One primary challenge in integrating 3D mesh generation into LLMs is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly. Existing methods often train a new tokenizer, such as a vector-quantized variational autoencoder (VQ-VAE), to encode the new modality into discrete tokens. However, this approach requires vocabulary expansion and introduces information loss during the auto-encoding process.

How LLaMA-MESH Works:
LLaMA-MESH addresses this challenge by representing the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without expanding the vocabulary. Specifically:
● It uses the OBJ file format, a widely adopted text-based standard for 3D models comprising vertex coordinates and face definitions.
● It treats the numerical values of vertex coordinates and face definitions as a sequence of text.
● It quantizes the vertex coordinates into a fixed number of bins (64 per axis) to address the issue of long token sequences resulting from floating-point numbers.

This design capitalizes on the extensive knowledge embedded in pretrained LLMs, which demonstrate a native ability to represent 3D structures in text.
LLaMA-MESH fine-tunes a pretrained LLaMA-3.1-8B-Instruct model on a curated dataset that includes text-3D pairs and interleaved text-3D dialogues. The model is trained to:
1. Generate 3D meshes from text prompts
2. Produce interleaved outputs of text and 3D meshes in a conversational setup
3. Describe meshes in natural language

The training dataset is constructed using a combination of:
● Rule-based approach: Simple patterns are designed to teach the LLM the correspondence between text and 3D representations.
● LLM-based augmentation: Pretrained LLMs generate complex text-3D dialogues based on sample dialogues and textual descriptions of 3D objects.

To preserve the LLM’s language capabilities, the final dataset also includes general conversation data from UltraChat.

Results:
LLaMA-MESH successfully generates high-quality 3D meshes from text prompts, produces diverse outputs, and maintains the language understanding and reasoning capabilities of the base LLM. It achieves mesh generation quality comparable to models trained from scratch while being significantly more efficient in training.

Limitations and Future Work:
The paper acknowledges several limitations of LLaMA-MESH, including potential loss of geometric detail due to vertex quantization, constraints on mesh complexity imposed by context length limitations, and slight degradation in language ability after fine-tuning. Future work could focus on addressing these limitations by exploring more efficient encoding schemes, methods to handle longer context lengths, techniques to improve geometric precision, and incorporating more diverse datasets for training.

Conclusion:
LLaMA-MESH represents a significant step toward integrating multimodal content generation within a cohesive language model, paving the way for more intuitive and efficient workflows in 3D content creation driven by language-based instructions.