His research interests lie in geometric generative modeling and its applications to multimodal foundation models, world models, embodied AI, and AI for health.
Selected publications are highlighted. (*Equal contribution. ✝Project lead. ✉Corresponding author.)
StereoAdapter: Adapting Stereo Depth Estimation to Underwater Scenes
Zhengri Wu*,
Yiran Wang*,
Yu Wen*,
Zeyu Zhang*✝,
Biao Wu,
Hao Tang✉ ICRA 2026
StereoAdapter is a self-supervised adaptive model that allows robust underwater depth estimation.
VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery
Nonghai Zhang*,
Zeyu Zhang*✝,
Jiazi Wang*,
Yang Zhao,
Hao Tang✉ ICLR 2026
VaseVQA-3D introduces the innovative 3D visual question-answering dataset for ancient Greek pottery, featuring 664 annotated vase models, while VaseVLM is a domain-adaptive vision-language model trained for cultural heritage analysis.
Motion-R1: Enhancing Motion Generation with Decomposed Chain-of-Thought and RL Binding
Runqi Ouyang,
Haoyun Li,
Zhenyuan Zhang,
Xiaofeng Wang,
Zeyu Zhang,
Zheng Zhu,
Guan Huang,
Sirui Han,
Xinggang Wang
ICLR 2026
Motion-R1 combines decomposed Chain-of-Thought reasoning with reinforcement learning to better capture temporal causality in text-to-motion generation, achieving state-of-the-art quality and alignment across benchmarks.
HiVid: LLM-Guided Video Saliency For Content-Aware VOD And Live Streaming
Jiahui Chen,
Bo Peng,
Lianchen Jia,
Zeyu Zhang,
Tianchi Huang,
Lifeng Sun ICLR 2026
HiVid uses LLMs as human proxies to generate chunk-level importance weights for VOD and live streaming, improving QoE via perception, global ranking, and low-latency prediction.
FlashMo: Geometric Interpolants and Frequency-Aware Sparsity for Scalable Efficient Motion Generation Zeyu Zhang*,
Yiran Wang*,
Danning Li*,
Dong Gong,
Ian Reid,
Richard Hartley
NeurIPS 2025
FlashMo introduces a geometric factorized interpolant and frequency-sparse attention, enabling scalable efficient 3D motion diffusion. Experiments show superior quality, efficiency, and scalability over state-of-the-art baselines.
FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion
Akide Liu*,
Zeyu Zhang*,
Zhexin Li,
Xuehai Bai,
Yuanjie Xing,
Yizeng Han,
Jiasheng Tang,
Jichao Wu,
Mingyang Yang,
Weihua Chen,
Jiahao He,
Yuanyu He,
Fan Wang,
Gholamreza Haffari,
Bohan Zhuang NeurIPS 2025Spotlight
FPSAttention is a training-aware FP8 quantization and sparsity co-design for video diffusion models that achieves up to 7.09x kernel speedups and 4.96× E2E speedups without quality loss by aligning 3D tile granularity, denoising-step adaptation, and hardware-efficient kernels.
ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS
Weijie Wang,
Yuedong Chen,
Zeyu Zhang,
Duochao Shi,
Akide Liu,
Bohan Zhuang NeurIPS 2025
ZPressor is an architecture-agnostic module that compresses multi-view inputs for scalable feed-forward 3DGS.
TRiCo: Triadic Game-Theoretic Co-Training for Robust Semi-Supervised Learning
Hongyang He,
Xinyuan Song,
Yangfan He,
Zeyu Zhang,
Yanshu Li,
Haochen You,
Lifan Sun,
Wenqiao Zhang NeurIPS 2025
TRiCo introduces a triadic game-theoretic co-training framework with two students, a meta-learned teacher, and an adversarial generator, leveraging mutual information pseudo-labeling to achieve state-of-the-art semi-supervised learning performance.
OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad
Luyao Tang,
Chaoqi Chen,
Yuxuan Yuan,
Zeyu Zhang,
Yue Huang,
Kun Zhang
CVPR 2025
Foundation models struggle with distribution shifts and weak supervision. We propose OCRT, a framework extracting high-level concepts and relations, enhancing SAM and CLIP generalizability in diverse tasks.
Efficient Learning With Sine-Activated Low-rank Matrices
Yiping Ji,
Hemanth Saratchandran,
Cameron Gordon,
Zeyu Zhang,
Simon Lucey
ICLR 2025
We propose a novel theoretical framework integrating a sinusoidal function into low-rank decomposition, enhancing parameter efficiency and model accuracy across diverse neural network applications such as Vision Transformers, Large Language Models, Neural Radiance Fields, and 3D shape modeling.
Motion Mamba: Efficient and Long Sequence Motion Generation Zeyu Zhang*,
Akide Liu*,
Ian Reid,
Richard Hartley,
Bohan Zhuang,
Hao Tang
ECCV 2024
Human motion generation is a key goal in generative computer vision, and we propose Motion Mamba, a model using state space models (SSMs) with Hierarchical Temporal Mamba (HTM) and Bidirectional Spatial Mamba (BSM) blocks, achieving up to 50% FID improvement and 4x speedup on HumanML3D and KIT-ML datasets, showcasing efficient and high-quality long sequence motion modeling.
Research Projects
BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation Zeyu Zhang,
Shuning Chang,
Yuanyu He,
Yizheng Han,
Jiasheng Tang✉,
Fan Wang,
Bohan Zhuang✉
BlockVid is a semi-AR block diffusion framework equipped with semantic sparse KV caching, block forcing, and noise scheduling. Furthermore, LV-Bench is a fine-grained benchmark for minute-long videos with dedicated metrics to evaluate long-range coherence.
VLA-R1: Enhancing Reasoning in Vision-Language-Action Models
Angen Ye*,
Zeyu Zhang*,
Boyuan Wang,
Xiaofeng Wang,
Dapeng Zhang,
Zheng Zhu✉
VLA-R1 is a reasoning-enhanced vision–language–action model that enables step-by-step reasoning and robust action execution across diverse tasks and domains.
Nav-R1: Reasoning and Navigation in Embodied Scenes
Qingxiang Liu*,
Ting Huang*,
Zeyu Zhang*✝,
Hao Tang✉
Nav-R1 is an embodied foundation model that integrates dialogue, reasoning, planning, and navigation capabilities to enable intelligent interaction and task execution in 3D environments.
3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding
Ting Huang*,
Zeyu Zhang*✝,
Hao Tang✉
3D-R1 is an open-source generalist model that enhances the reasoning of 3D VLMs for unified scene understanding.
Motion Anything: Any to Motion Generation Zeyu Zhang*,
Yiran Wang*,
Wei Mao,
Danning Li,
Akira Zhao,
Biao Wu,
Zirui Song,
Bohan Zhuang,
Ian Reid,
Richard Hartley
Motion Anything advances multimodal motion generation with an Any-to-Motion framework, introducing Attention-based Mask Modeling for fine-grained control. It surpasses prior methods and introduces TMD, a large text-music-dance dataset, achieving state-of-the-art results.
Research Assistant La Trobe University Apr 2024 - Present
3D generation and AI for Heath, working with Dr. Yang Zhao (La Trobe University).
Research Assistant Monash University Feb 2024 - May 2024
3D/4D generative learning, specifically focusing on text-guided human motion and avatar generation, working with Prof. Reza Haffari (Monash University), and Prof. Bohan Zhuang (ZJU, Monash University).
Bachelor of Science (Advanced) (Honours) The Australian National University (ANU) Jul 2021 - Jun 2025
Major: Computer Science, Minor: Mathematics, First Class Honours (H1), GPA: 6.656/7
Visiting Student Imperial College London Jul 2022
Quantitative Sciences Research Institute (QSRI)