Motion generation conditioned on inputs such as text and music has been extensively studied in computer vision. While specialized models exist for text-to-motion (T2M) or music-to-dance (M2D) generation, and some unified models handle multimodal conditioning, they are limited to processing only one type of input at a time and cannot generate avatars or background music. To address these challenges, our paper introduces several key contributions. (1) Firstly, we propose Motion Anything, a pioneering method capable of tackling multiple modalities simultaneously to generate 4D avatars with background music and text queries. (2) Additionally, we designed the Temporal Adaptive Transformer, which adaptively aligns different modalities of conditions to control motion generation in a time-sensitive manner. Meanwhile, our Spatial Aligning Transformer maps action text to specific body-part movements and aligns music genres with corresponding dance styles. (3) Furthermore, we developed an attention-based spatial and temporal mask modeling approach for more effective autoregressive generation. (4) In addition, we introduced a Selective Rigging Mechanism for improved automatic rigging of 3D meshes with skeletons. (5) We also created a new dataset named Text-Music-Dance (TMD), consisting of 2,153 paired samples of text, music, and dance, making it twice as large as AIST++. (6) Lastly, we conducted extensive experiments on standard benchmarks across various motion generation tasks. Our method achieved a 15% improvement in FID on HumanML3D and showed consistent performance gains on AIST++.