Publications
Highlighted
Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage
International Conference on Learning Representations (ICLR)
·
2025
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
International Conference on Computer Vision (ICCV)
·
2025
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
International Conference on Computer Vision (ICCV)
·
2025
End-to-End Neuro-Symbolic Reinforcement Learning with Textual Explanations
International Conference on Machine Learning (ICML)
·
2024
All
2025
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
International Conference on Computer Vision (ICCV)
·
2025
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes
Neural Information Processing Systems (NeurIPS D&B)
·
2025
Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning
Neural Information Processing Systems (NeurIPS)
·
2025
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
International Conference on Computer Vision (ICCV)
·
2025
MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge
International Conference on Learning Representations (ICLR)
·
2025
Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage
International Conference on Learning Representations (ICLR)
·
2025
2024
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models
Neural Information Processing Systems (NeurIPS D&B)
·
2024
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
Neural Information Processing Systems (NeurIPS D&B)
·
2024
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World
International Conference on Learning Representations (ICLR)
·
2024
CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
·
2024
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
European Conference on Computer Vision (ECCV)
·
2024
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
International Conference on Learning Representations (ICLR)
·
2024
An Embodied Generalist Agent in 3D World
International Conference on Machine Learning (ICML)
·
2024
End-to-End Neuro-Symbolic Reinforcement Learning with Textual Explanations
International Conference on Machine Learning (ICML)
·
2024
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
European Conference on Computer Vision (ECCV)
·
2024
Neural-Symbolic Recursive Machine for Systematic Generalization
International Conference on Learning Representations (ICLR)
·
2024
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models
Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
·
2024
2023
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
International Conference on Computer Vision (ICCV)
·
2023
A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics
International Conference on Learning Representations (ICLR)
·
2023
Exploring Data Geometry for Continual Learning
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
·
2023
SQA3D: Situated Question Answering in 3D Scenes
International Conference on Learning Representations (ICLR)
·
2023
Learning non-Markovian Decision-Making from State-only Sequences
Advances in Neural Information Processing Systems (NeurIPS)
·
2023
Meta-causal Learning for Single Domain Generalization
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
·
2023
Learning to Optimize on Riemannian Manifolds
Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
·
2023
Curvature-Adaptive Meta-Learning for Fast Adaptation to Manifold Data
Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
·
2023
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents
Advances in Neural Information Processing Systems (NeurIPS)
·
2023