In recent years, there has been significant interest in creating 3D avatars and motions, driven by their diverse applications in areas like film-making, video games, AR/VR, and human-robot interaction. However, current efforts primarily concentrate on either generating the 3D avatar mesh alone or producing motion sequences, with integrating these two aspects proving to be a persistent challenge. Additionally, while avatar and motion generation predominantly target humans, extending these techniques to animals remains a significant challenge due to inadequate training data and methods. To bridge these gaps, our paper presents three key contributions. Firstly, we proposed a novel agent-based approach named Motion Avatar, which allows for the automatic generation of high-quality customizable human and animal avatars with motions through text queries. The method significantly advanced the progress in dynamic 3D character generation. Secondly, we introduced a LLM planner that coordinates both motion and avatar generation, which transforms a discriminative planning into a customizable Q&A fashion. Lastly, we presented an animal motion dataset named Zoo-300K, comprising approximately 300,000 text-motion pairs across 65 animal categories and its building pipeline ZooGen, which serves as a valuable resource for the community.
The figure illustrates various examples of animal motion generated by Motion Avatar, demonstrating its ability to produce high-quality motion and mesh for both human and animal characters.
Motion Avatar utilizes a LLM-agent based approach to manage user queries and produce tailored prompts. These prompts are designed to facilitate both the generation of motion sequences and the creation of 3D meshes. Motion generation follows an autoregressive process, while mesh generation operates within an image-to-3D framework. Subsequently, the generated mesh undergoes an automatic rigging process, allowing the motion to be retargeted to the rigged mesh.
The diagram illustrates the process of our proposed ZooGen. Initially, SinMDM is employed to edit and enhance motion within Truebones Zoo. Subsequently, Video-LLaMA is utilized to describe the motion in a paragraph, followed by refinement using LLaMA-70B. Finally, human review is conducted on the motion captions, which are then gathered as textual descriptions in the Zoo-300K dataset
@article{zhang2024motionavatar,
title={Motion Avatar: Generate Human and Animal Avatars with Arbitrary Motion},
author={Zhang, Zeyu and Wang, Yiran and Wu, Biao and Chen, Shuo and Zhang, Zhiyuan and Huang, Shiya and Zhang, Wenbo and Fang, Meng and Chen, Ling and Zhao, Yang},
journal={arXiv preprint arXiv:2405.11286},
year={2024}
}