Motion Mamba:

Zeyu Zhang^1*, Yiran Wang^1*, Biao Wu^2*, Shuo Chen³, Zhiyuan Zhang⁴, Shiya Huang⁴, Wenbo Zhang⁴, Meng Fang⁵, Ling Chen², Yang Zhao^6✉

¹ The Australian National University ² University of Technology Sydney
³ Monash University ⁴ The University of Adelaide
⁵ University of Liverpool ⁶ La Trobe University

^*Equal Contribution. ^✉Corresponding author.

arXiv Code Dataset BibTeX

News:

(05/23/2024) 🎉 Our paper has been promoted by AI Bites!
(05/22/2024) 🎉 Our paper has been promoted by Language Model Digest!
(05/21/2024) 🎉 Our paper has been promoted by CSVisionPapers!

Abstract

In recent years, there has been significant interest in creating 3D avatars and motions, driven by their diverse applications in areas like film-making, video games, AR/VR, and human-robot interaction. However, current efforts primarily concentrate on either generating the 3D avatar mesh alone or producing motion sequences, with integrating these two aspects proving to be a persistent challenge. Additionally, while avatar and motion generation predominantly target humans, extending these techniques to animals remains a significant challenge due to inadequate training data and methods. To bridge these gaps, our paper presents three key contributions. Firstly, we proposed a novel agent-based approach named Motion Avatar, which allows for the automatic generation of high-quality customizable human and animal avatars with motions through text queries. The method significantly advanced the progress in dynamic 3D character generation. Secondly, we introduced a LLM planner that coordinates both motion and avatar generation, which transforms a discriminative planning into a customizable Q&A fashion. Lastly, we presented an animal motion dataset named Zoo-300K, comprising approximately 300,000 text-motion pairs across 65 animal categories and its building pipeline ZooGen, which serves as a valuable resource for the community.

Visualization

A demon girl with blue hair is running fast.

A demon girl with blue hair spins rapidly.

A demon girl with blue hair holds her left foot with her left hand.

Luigi is bending his leg.

Luigi is doing kung fu.

Luigi is rolling his hands.

Michelin Man is dancing zumba.

Michelin Man looking around himself.

Michelin Man raised up his hand and shake.

A yellow humanoid Yacuruna does a baseball hit.

A yellow humanoid Yacuruna is crossing his arm.

A yellow humanoid Yacuruna is jumping

A red robot is boxing.

A red robot is doing hip pop dancing.

A red robot is saluting.

A bear runs then stands then runs again.

A bear walks then stands.

A bear runs then walks hesitantly.

A jaguar attacks then dies.

A jaguar gets up then walks.

A bear walks then attacks.

A wolf attacks then howls.

A wolf runs then attacks then runs.

A wolf walks then dies.

A horse attacks, then walks backward, and attacks again.

A horse walks, then rears up, and then walks again.

A horse rears up and then walks forward.

An anaconda first coils up and then attacks.

An anaconda first swings its tail and then rises.

An anaconda first attacks and then spins.

A rapator first walks, then attacks, and then walks again.

A rapator first walks then dies.

A rapator first walks, then roars, and then walks again.

More Visualization

The figure illustrates various examples of animal motion generated by Motion Avatar, demonstrating its ability to produce high-quality motion and mesh for both human and animal characters.

Methodology

Motion Avatar utilizes a LLM-agent based approach to manage user queries and produce tailored prompts. These prompts are designed to facilitate both the generation of motion sequences and the creation of 3D meshes. Motion generation follows an autoregressive process, while mesh generation operates within an image-to-3D framework. Subsequently, the generated mesh undergoes an automatic rigging process, allowing the motion to be retargeted to the rigged mesh.

Dataset

The diagram illustrates the process of our proposed ZooGen. Initially, SinMDM is employed to edit and enhance motion within Truebones Zoo. Subsequently, Video-LLaMA is utilized to describe the motion in a paragraph, followed by refinement using LLaMA-70B. Finally, human review is conducted on the motion captions, which are then gathered as textual descriptions in the Zoo-300K dataset

BibTeX

@article{zhang2024motion,
  title={Motion Avatar: Generate Human and Animal Avatars with Arbitrary Motion},
  author={Zhang, Zeyu and Wang, Yiran and Wu, Biao and Chen, Shuo and Zhang, Zhiyuan and Huang, Shiya and Zhang, Wenbo and Fang, Meng and Chen, Ling and Zhao, Yang},
  journal={arXiv preprint arXiv:2405.11286},
  year={2024}
}