Abstract

Motion generation conditioned on inputs such as text and music has been extensively studied in computer vision. While specialized models exist for text-to-motion (T2M) or music-to-dance (M2D) generation, and some unified models handle multimodal conditioning, they are limited to processing only one type of input at a time and cannot generate avatars or background music. To address these challenges, our paper introduces several key contributions. (1) Firstly, we propose Motion Anything, a pioneering method capable of tackling multiple modalities simultaneously to generate 4D avatars with background music and text queries. (2) Additionally, we designed the Temporal Adaptive Transformer, which adaptively aligns different modalities of conditions to control motion generation in a time-sensitive manner. Meanwhile, our Spatial Aligning Transformer maps action text to specific body-part movements and aligns music genres with corresponding dance styles. (3) Furthermore, we developed an attention-based spatial and temporal mask modeling approach for more effective autoregressive generation. (4) In addition, we introduced a Selective Rigging Mechanism for improved automatic rigging of 3D meshes with skeletons. (5) We also created a new dataset named Text-Music-Dance (TMD), consisting of 2,153 paired samples of text, music, and dance, making it twice as large as AIST++. (6) Lastly, we conducted extensive experiments on standard benchmarks across various motion generation tasks. Our method achieved a 15% improvement in FID on HumanML3D and showed consistent performance gains on AIST++.

Text to Motion

BAD
BAMM
MoMask
Motion Anything
A man walk forward with both hands above head.
A man walk forward with both hands above head.
A man walk forward with both hands above head.
A man walk forward with both hands above head.
A man walk clockwise in a circle.
A man walk clockwise in a circle.
A man walk clockwise in a circle.
A man walk clockwise in a circle.
A man picks up something and throw it away.
A man picks up something and throw it away.
A man picks up something and throw it away.
A man picks up something and throw it away.

Music to Dance

EDGE
Lodge
Bailando
Motion Anything
Marshall Jefferson - Move Your Body (Chicago House)
Marshall Jefferson - Move Your Body (Chicago House)
Marshall Jefferson - Move Your Body (Chicago House)
Marshall Jefferson - Move Your Body (Chicago House)
Stardust - Music Sounds Better With You (French House)
Stardust - Music Sounds Better With You (French House)
Stardust - Music Sounds Better With You (French House)
Stardust - Music Sounds Better With You (French House)
Paul Kalkbrenner - Sky and Sand (Tech House)
Paul Kalkbrenner - Sky and Sand (Tech House)
Paul Kalkbrenner - Sky and Sand (Tech House)
Paul Kalkbrenner - Sky and Sand (Tech House)

Text and Music to Dance

TM2D
MotionCraft
Motion Anything
A man is doing groove and swaying steps along with the beat with Daft Punk - Get Lucky (Disco).
A man is doing groove and swaying steps along with the beat with Daft Punk - Get Lucky (Disco).
A man is doing groove and swaying steps along with the beat with Daft Punk - Get Lucky (Disco).
A man is dancing along the beats while use both hands to touch legs and swing back and forth with Daft Punk - One More Time (French House).
A man is dancing along the beats while use both hands to touch legs and swing back and forth with Daft Punk - One More Time (French House).
A man is dancing along the beats while use both hands to touch legs and swing back and forth with Daft Punk - One More Time (French House).

Text to Music and Dance

TM2D
MotionCraft
Motion Anything
A man is doing street dance, kick side step along the beats, with an energetic dance track with 120–130 BPM, vibrant synths, punchy beats, and uplifting melodies.
A man is doing street dance, kick side step along the beats, with an energetic dance track with 120–130 BPM, vibrant synths, punchy beats, and uplifting melodies.
A man is doing street dance, kick side step along the beats, with an energetic dance track with 120–130 BPM, vibrant synths, punchy beats, and uplifting melodies.
A man alternates lifting his arms overhead, following the beats, with an energetic dance track with 120–130 BPM, dynamic bass lines, punchy beats, and modern electronic elements.
A man alternates lifting his arms overhead, following the beats, with an energetic dance track with 120–130 BPM, dynamic bass lines, punchy beats, and modern electronic elements.
A man alternates lifting his arms overhead, following the beats, with an energetic dance track with 120–130 BPM, dynamic bass lines, punchy beats, and modern electronic elements.