FlashMo

Abstract

Diffusion models have recently advanced 3D human motion generation by producing smoother and more realistic sequences from natural language. However, existing approaches face two major challenges: high computational cost during training and inference, and limited scalability due to reliance on U-Net inductive bias. To address these challenges, we propose FlashMo, a frequency-aware sparse motion diffusion model that prunes low-frequency tokens to enhance efficiency without custom kernel design. We further introduce MotionSiT, a scalable diffusion transformer based on a joint-temporal factorized interpolant with Lie group geodesics over $SO(3)$ manifolds, enabling principled generation of joint rotations. Extensive experiments on the large-scale MotionHub V2 dataset and standard benchmarks including HumanML3D and KIT-ML demonstrate that our method significantly outperforms previous approaches in motion quality, efficiency, and scalability. Compared to the state-of-the-art 1-step distillation baseline, FlashMo reduces 12.9% inference time and FID by 34.1%.

Observation 1: Motion Frequency

Frequency of human motion. The diagram illustrates that dynamic motions exhibit higher frequencies, while static motions correspond to lower frequencies. This observation provides key insights for the frequency-aware sparsification design of our FlashMo.

Observation 2: Head-Level Pattern

Frequency magnitude vs. attention heatmap of attention heads. The frequency magnitude is computed with the Fast Fourier Transform (FFT) and averaged over 100 latent motion features. Brighter colors indicate higher magnitudes. Pixels closer to the center represent lower frequencies. Both maps are visualized from the last layer of MotionSiT.

Architecture

FlashMo architecture. FlashMo leverages frequency-aware sparsification and a MotionSiT backbone to efficiently and scalably generate motion from noised latent inputs.

Geometric Interpolants

To improve scalability, we design geometric factorized interpolants, which differ from the standard SiT that interpolates all dimensions jointly in Euclidean space. Instead, we apply temporal–spatial factorized interpolation with Lie group geodesics on the manifold of joint rotations. This approach enables training diffusion directly on the $so(3)$ representation, leading to more stable training and better scalability on large datasets such as MotionHub V2.

Scalability

Scaling trend. The figure demonstrates the scaling trends of different denoiser designs (U-Net, DiT, SiT, and MotionSiT) with varying proportions of pretraining data. The results show that our MotionSiT exhibits superior scalability and outperforms other methods.

Efficiency

Efficiency comparison. The figure demonstrates that FlashMo achieves the lowest inference time, training time, model size, and FLOPs while maintaining superior performance compared to other methods.

BibTeX

@inproceedings{zhang2025flashmo,
  title={FlashMo: Geometric Interpolants and Frequency-Aware Sparsity for Scalable Efficient Motion Generation},
  author={Zhang, Zeyu and Wang, Yiran and Li, Danning and Gong, Dong and Reid, Ian and Hartley, Richard},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025}
}

FlashMo: Geometric Interpolants and Frequency-Aware Sparsity for Scalable Efficient Motion Generation

NeurIPS 2025

News: