InfiniMotion: Mamba in Mamba for Long Motion Generation

 

1ANU 2Monash 3AIML 4AI Geeks 5UCAS 6MBZUAI 7PKU

 

Corresponding author.

Abstract

Text-to-motion generation holds potential for film, gaming, and robotics, yet current methods often prioritize short motion generation, making it challenging to produce long motion sequences effectively: (1) Current methods struggle to handle long motion sequences as a single input due to prohibitively high computational cost; (2) Breaking down the generation of long motion sequences into shorter segments can result in inconsistent transitions and requires interpolation or inpainting, which lacks entire sequence modeling. To address these challenges, we introduce InfiniMotion, a pioneering mamba-in-mamba architecture that recurrently updates both the Transformer's memory and the Mamba's memory to enhance long-motion generation. To improve the effectiveness and robustness of human motion generation, we introduce a similarity-based masking strategy that selectively masks key frames and key joints in the temporal and spatial dimensions, respectively. Additionally, we propose a hybrid architecture comprising the Mask Temporal Mamba and Mask Spatial Transformer, designed to process temporal and spatial information independently. Notably, our method achieves more than a 15% improvement in FID on the BABEL dataset compared to previous state-of-the-art approaches, demonstrating substantial progress in long motion generation.

Visualization

A person walks forward, a person bends down looking like a chicken, a person running and pick up an object.
A person bend down with both hands and feet touch ground like a bridge, stand up and do jogging.
A person sit down and stand up for multiple time. A person doing a forward kick with. A person scratches their head, A person with arm up and to the side, lowers arms and rotates in circular motion.
A person sit down with both knees. A person is doing jumping jacks like work-out.
A man put both arms in parallel, raise up above head, touch head with both hands, greet someone for a while, walking forward carefully like touching the rail.
A man walks only one step forward. A man doing swimming in the pool.
A a man is dancing along the music. A man make a pose like putting both hands in front of the chest.
A man leans forward to pick up an object slightly to his left, and places it down slightly to his right. A person is doing jumping jacks.
A person reach out both arms away from legs, use left hand to pick up some thing from the ground, sit down and stand up again.
A person stretches their left shoulder, bringing it back , rotating it and then switches to their right shoulder, bringing it back twice, rotating it twice. They then move to warming up or stretching out their elbow joint by bringing their hand to their shoulder and then extend it fully and turn their inner arm outwards. Someone working out the left arm.
Putting both hands in front the chest and slowly moving both arms parallel to the ground. A person step back a bit, sitting on a chair.
A person kneels down on both knees. A person walks slowly forward holding handrail with right hand.
The man reaches to the ground for something places it on the table then reaches for another thing. A person is acting like a monkey. A person takes a quick step backwards and to their left. A person pokes their right hand along the ground, like they might be planting seeds.
The person makes a right turn. A person grabbing something in front of them, swinging it around to the side then throwing it overhead.
A person is leaning and checking their surroundings. A person slowly moved in right direction. A person moves their left arm up and the right arm toward their chest as a gesture.

Comparsion

TEACH
PriorMDM
FlowMDM
InfiniMotion
A man reaching both hands to the sky while walking, a man then jump upward for only once.
A man reaching both hands to the sky while walking, a man then jump upward for only once.
A man reaching both hands to the sky while walking, a man then jump upward for only once.
A man reaching both hands to the sky while walking, a man then jump upward for only once.
A man doing a frog jump forward, a man squat on the ground.
A man doing a frog jump forward, a man squat on the ground.
A man doing a frog jump forward, a man squat on the ground.
A man doing a frog jump forward, a man squat on the ground.
The man walk forward, the man stops and raised up hands .
The man walk forward, the man stops and raised up hands .
The man walk forward, the man stops and raised up hands .
The man walk forward, the man stops and raised up hands .

Methodology

 

 


This diagram illustrates the main architecture of our proposed method, InfiniMotion. The core of InfiniMotion is (d) a Mamba-in-Mamba structure that recursively updates the memory of both the Transformer and Mamba to support long motion generation. Components (a) and (c) illustrate the similarity-based masking applied to the temporal and spatial dimensions, respectively. (b) shows the fundamental building block of InfiniMotion, the Hybrid Masking Block, which consists of the Mask Temporal Mamba and Mask Spatial Transformer to process temporal and spatial information separately.

Performance

 

 


InfiniMotion presents a Mamba-in-Mamba architecture that recurrently updates both the Transformer's and Mamba's memory to enable efficient and consistent long-motion generation.


BibTeX

@article{zhang2024infinimotion,
  title={Infinimotion: Mamba boosts memory in transformer for arbitrary long motion generation},
  author={Zhang, Zeyu and Liu, Akide and Chen, Qi and Chen, Feng and Reid, Ian and Hartley, Richard and Zhuang, Bohan and Tang, Hao},
  journal={arXiv preprint arXiv:2407.10061},
  year={2024}
}