Text-to-motion generation holds potential for film, gaming, and robotics, yet current methods often prioritize short motion generation, making it challenging to produce long motion sequences effectively: (1) Current methods struggle to handle long motion sequences as a single input due to prohibitively high computational cost; (2) Breaking down the generation of long motion sequences into shorter segments can result in inconsistent transitions and requires interpolation or inpainting, which lacks entire sequence modeling. To address these challenges, we introduce InfiniMotion, a pioneering mamba-in-mamba architecture that recurrently updates both the Transformer's memory and the Mamba's memory to enhance long-motion generation. To improve the effectiveness and robustness of human motion generation, we introduce a similarity-based masking strategy that selectively masks key frames and key joints in the temporal and spatial dimensions, respectively. Additionally, we propose a hybrid architecture comprising the Mask Temporal Mamba and Mask Spatial Transformer, designed to process temporal and spatial information independently. Notably, our method achieves more than a 15% improvement in FID on the BABEL dataset compared to previous state-of-the-art approaches, demonstrating substantial progress in long motion generation.
This diagram illustrates the main architecture of our proposed method, InfiniMotion. The core of InfiniMotion is (d) a Mamba-in-Mamba structure that recursively updates the memory of both the Transformer and Mamba to support long motion generation. Components (a) and (c) illustrate the similarity-based masking applied to the temporal and spatial dimensions, respectively. (b) shows the fundamental building block of InfiniMotion, the Hybrid Masking Block, which consists of the Mask Temporal Mamba and Mask Spatial Transformer to process temporal and spatial information separately.
InfiniMotion presents a Mamba-in-Mamba architecture that recurrently updates both the Transformer's and Mamba's memory to enable efficient and consistent long-motion generation.
@article{zhang2024infinimotion,
title={Infinimotion: Mamba boosts memory in transformer for arbitrary long motion generation},
author={Zhang, Zeyu and Liu, Akide and Chen, Qi and Chen, Feng and Reid, Ian and Hartley, Richard and Zhuang, Bohan and Tang, Hao},
journal={arXiv preprint arXiv:2407.10061},
year={2024}
}