Zeyu Zhang

Zeyu Zhang is a researcher in generative AI.

His research interests lie in geometric generative modeling and its applications to multimodal foundation models, world models, embodied AI, and AI for health.

He received his bachelor’s degree at the Australian National University, advised by Prof. Richard Hartley and Prof. Ian Reid.

News

(03/11/2026) 🎉 Glad to receive the Berkeley Fellowship!
(02/21/2026) 🎉 Our paper GeoWorld has been accepted to CVPR 2026!
(01/27/2026) 🎉 Our paper VaseVQA-3D has been accepted to ICLR 2026!
(12/04/2025) 🎉 Our paper BlockVid has been shared in Daily Papers by AK!
(11/27/2025) 🎉 Glad to receive the Australasian Undergraduate Research Medal!
(09/18/2025) 🎉 Our paper FlashMo and FPSAttention has been accepted to NeurIPS 2025!
(08/05/2025) 🎉 Our paper 3D-R1 has been shared in Daily Papers by AK!
(07/02/2024) 🎉 Our paper Motion Mamba has been accepted to ECCV 2024!
(03/13/2024) 🎉 Our paper Motion Mamba has been shared in Daily Papers by AK!

Publications

Selected publications are highlighted. (^*Equal contribution. ^✝Project lead. ^✉Corresponding author.)

	GeoWorld: Geometric World Models Zeyu Zhang, Danning Li, Ian Reid, Richard Hartley *CVPR 2026* GeoWorld introduces hyperbolic energy-based world models with geometric reinforcement learning, enabling stable long-horizon visual planning and hierarchical reasoning, outperforming V-JEPA-2 on CrossTask and COIN benchmarks.
	Decoupling Defense Strategies for Robust Image Watermarking Jiahui Chen, Zehang Deng, Zeyu Zhang, Chaoyang Li, Lianchen Jia, Lifeng Sun *CVPR 2026* AdvMark decouples defenses via two-stage fine-tuning, preserving clean accuracy while achieving strong robustness against adversarial, distortion, and regeneration attacks overall.
	StereoAdapter: Adapting Stereo Depth Estimation to Underwater Scenes Zhengri Wu^, Yiran Wang^, Yu Wen^, Zeyu Zhang^✝, Biao Wu, Hao Tang^✉ *ICRA 2026* StereoAdapter is a self-supervised adaptive model that allows robust underwater depth estimation.
	VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery Nonghai Zhang^, Zeyu Zhang^✝, Jiazi Wang^, Yang Zhao, Hao Tang^✉ ICLR 2026* VaseVQA-3D introduces the innovative 3D visual question-answering dataset for ancient Greek pottery, featuring 664 annotated vase models, while VaseVLM is a domain-adaptive vision-language model trained for cultural heritage analysis.
	Motion-R1: Enhancing Motion Generation with Decomposed Chain-of-Thought and RL Binding Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zeyu Zhang, Zheng Zhu, Guan Huang, Sirui Han, Xinggang Wang *ICLR 2026* Motion-R1 combines decomposed Chain-of-Thought reasoning with reinforcement learning to better capture temporal causality in text-to-motion generation, achieving state-of-the-art quality and alignment across benchmarks.
	HiVid: LLM-Guided Video Saliency For Content-Aware VOD And Live Streaming Jiahui Chen, Bo Peng, Lianchen Jia, Zeyu Zhang, Tianchi Huang, Lifeng Sun *ICLR 2026* HiVid uses LLMs as human proxies to generate chunk-level importance weights for VOD and live streaming, improving QoE via perception, global ranking, and low-latency prediction.
	FlashMo: Geometric Interpolants and Frequency-Aware Sparsity for Scalable Efficient Motion Generation Zeyu Zhang^, Yiran Wang^, Danning Li^, Dong Gong, Ian Reid, Richard Hartley NeurIPS 2025* FlashMo introduces a geometric factorized interpolant and frequency-sparse attention, enabling scalable efficient 3D motion diffusion. Experiments show superior quality, efficiency, and scalability over state-of-the-art baselines.
	FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion Akide Liu^, Zeyu Zhang^, Zhexin Li, Xuehai Bai, Yuanjie Xing, Yizeng Han, Jiasheng Tang, Jichao Wu, Mingyang Yang, Weihua Chen, Jiahao He, Yuanyu He, Fan Wang, Gholamreza Haffari, Bohan Zhuang *NeurIPS 2025* *Spotlight* FPSAttention is a training-aware FP8 quantization and sparsity co-design for video diffusion models that achieves up to 7.09x kernel speedups and 4.96× E2E speedups without quality loss by aligning 3D tile granularity, denoising-step adaptation, and hardware-efficient kernels.
	ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS Weijie Wang, Yuedong Chen, Zeyu Zhang, Duochao Shi, Akide Liu, Bohan Zhuang *NeurIPS 2025* ZPressor is an architecture-agnostic module that compresses multi-view inputs for scalable feed-forward 3DGS.
	TRiCo: Triadic Game-Theoretic Co-Training for Robust Semi-Supervised Learning Hongyang He, Xinyuan Song, Yangfan He, Zeyu Zhang, Yanshu Li, Haochen You, Lifan Sun, Wenqiao Zhang *NeurIPS 2025* TRiCo introduces a triadic game-theoretic co-training framework with two students, a meta-learned teacher, and an adversarial generator, leveraging mutual information pseudo-labeling to achieve state-of-the-art semi-supervised learning performance.
	OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad Luyao Tang, Chaoqi Chen, Yuxuan Yuan, Zeyu Zhang, Yue Huang, Kun Zhang *CVPR 2025* Foundation models struggle with distribution shifts and weak supervision. We propose OCRT, a framework extracting high-level concepts and relations, enhancing SAM and CLIP generalizability in diverse tasks.
	Efficient Learning With Sine-Activated Low-rank Matrices Yiping Ji, Hemanth Saratchandran, Cameron Gordon, Zeyu Zhang, Simon Lucey *ICLR 2025* We propose a novel theoretical framework integrating a sinusoidal function into low-rank decomposition, enhancing parameter efficiency and model accuracy across diverse neural network applications such as Vision Transformers, Large Language Models, Neural Radiance Fields, and 3D shape modeling.
	Motion Mamba: Efficient and Long Sequence Motion Generation Zeyu Zhang^, Akide Liu^, Ian Reid, Richard Hartley, Bohan Zhuang, Hao Tang *ECCV 2024* Human motion generation is a key goal in generative computer vision, and we propose Motion Mamba, a model using state space models (SSMs) with Hierarchical Temporal Mamba (HTM) and Bidirectional Spatial Mamba (BSM) blocks, achieving up to 50% FID improvement and 4x speedup on HumanML3D and KIT-ML datasets, showcasing efficient and high-quality long sequence motion modeling.

Research Projects

	BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation Zeyu Zhang, Shuning Chang, Yuanyu He, Yizheng Han, Jiasheng Tang^✉, Fan Wang, Bohan Zhuang^✉ BlockVid is a semi-AR block diffusion framework equipped with semantic sparse KV caching, block forcing, and noise scheduling. Furthermore, LV-Bench is a fine-grained benchmark for minute-long videos with dedicated metrics to evaluate long-range coherence.
	VLA-R1: Enhancing Reasoning in Vision-Language-Action Models Angen Ye^, Zeyu Zhang^, Boyuan Wang, Xiaofeng Wang, Dapeng Zhang, Zheng Zhu^✉ VLA-R1 is a reasoning-enhanced vision–language–action model that enables step-by-step reasoning and robust action execution across diverse tasks and domains.
	Nav-R1: Reasoning and Navigation in Embodied Scenes Qingxiang Liu^, Ting Huang^, Zeyu Zhang^*✝, Hao Tang^✉ Nav-R1 is an embodied foundation model that integrates dialogue, reasoning, planning, and navigation capabilities to enable intelligent interaction and task execution in 3D environments.
	3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding Ting Huang^, Zeyu Zhang^✝, Hao Tang^✉ 3D-R1 is an open-source generalist model that enhances the reasoning of 3D VLMs for unified scene understanding.
	Motion Anything: Any to Motion Generation Zeyu Zhang^, Yiran Wang^, Wei Mao, Danning Li, Akira Zhao, Biao Wu, Zirui Song, Bohan Zhuang, Ian Reid, Richard Hartley Motion Anything advances multimodal motion generation with an Any-to-Motion framework, introducing Attention-based Mask Modeling for fine-grained control. It surpasses prior methods and introduces TMD, a large text-music-dance dataset, achieving state-of-the-art results.

Research Experience

Research Assistant
Peking University
July 2024 - Present
Spatial intelligence and embodied AI, working with Asst. Prof. Hao Tang (PKU).

Research Assistant
La Trobe University
Apr 2024 - Present
3D generation and AI for Heath, working with Dr. Yang Zhao (La Trobe University).

Education Experience

Bachelor of Science (Advanced) (Honours)
The Australian National University (ANU)
Jul 2021 - Jun 2025
Major: Computer Science, Minor: Mathematics, First Class Honours (H1), GPA: 6.656/7

Honors and Awards

Berkeley Fellowship, UC Berkeley, Mar 2026.
Australasian Undergraduate Research Medal, Australasian Council for Undergraduate Research (ACUR), Nov 2025.
Chancellor's Letter of Commendation, The Australian National University, July 2025.
NRF Vacation Scholarship, NeuroSurgical Research Foundation, Oct 2023.
Flinders Summer Research Scholarship, Flinders University CMPH, Nov 2022.
UNSW Science Vacation Research Scholarship, The UNSW Sydney, Oct 2022.

Academic Services

Conference Reviewer: CVPR 2025 2026, ICLR 2025 2026, AAAI 2026, MM 2025, IJCAI 2025, MICCAI 2025, CHI 2025, 3DV 2026, VR 2025.

Talks

(03/03/2026) Latest Advances in Embodied Reasoning @ CVLife. [Recording/Slides]
(11/12/2025) FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion @ Alibaba DAMO Academy. [Recording/Slides]
(10/17/2025) Video World Models: Learning the Physical World from Videos @ Zhejiang University. [Recording/Slides]
(10/15/2025) How RL Enhances Spatial Understanding? @ NVIDIA Spatial Intelligence Lab. [Recording/Slides]
(10/09/2025) How RL Enhances Spatial Understanding? @ 3DCVer. [Recording/Slides]
(09/19/2025) Grounding Foundation Models to the Real World @ Peking University. [Recording/Slides]
(09/18/2025) Spatial Intelligence: From Virtual to Real Worlds @ Yahaha. [Recording/Slides]
(07/22/2024) Motion Mamba: Efficient and Long Sequence Motion Generation @ miHoYo. [Slides]

Creator

Researcher

Hey Guys, I'm Zeyu Zhang