GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects

1Shanghai Jiao Tong University 2Shanghai AI Lab 3Beihang University

*Indicates Corresponding author
Teaser Image

GenHOI synthesizes interaction between human and unseen object conditioned on text.

Abstract

While diffusion models and large-scale motion datasets have advanced text-driven human motion synthesis, extending these advances to 4D human-object interaction (HOI) remains challenging, mainly due to the limited availability of large-scale 4D HOI datasets. In our study, we introduce GenHOI, a novel two-stage framework aimed at achieving two key objectives: 1) generalization to unseen objects and 2) the synthesis of high-fidelity 4D HOI sequences. In the initial stage of our framework, we employ an Object-AnchorNet to reconstruct sparse 3D HOI keyframes for unseen objects, learning solely from 3D HOI datasets, thereby mitigating the dependence on large-scale 4D HOI datasets. Subsequently, we introduce a Contact-Aware Diffusion Model (ContactDM) in the second stage to seamlessly interpolate sparse 3D HOI keyframes into densely temporally coherent 4D HOI sequences. To enhance the quality of generated 4D HOI sequences, we propose a novel Contact-Aware Encoder within ContactDM to extract human-object contact patterns and a novel Contact-Aware HOI Attention to effectively integrate the contact signals into diffusion models. Experimental results show that we achieve state-of-the-art results on the publicly available OMOMO and 3D-FUTURE datasets, demonstrating strong generalization abilities to unseen objects, while enabling high-fidelity 4D HOI generation.

Video Presentation on Unseen Objects

Facing the back of the whitechair, lift the whitechair, move the whitechair, and then place the whitechair on the floor.

Lift the whitechair, move the whitechair, and put down the whitechair.

Kick the whitechair, and set it back down.

Lift the smalltable, move the smalltable and put down the smalltable.

Lift the smalltable above your head, walk, and put the smalltable down.

Push the smalltable.

Lift the suitcase, rotate the suitcase, and set it back down.

Pull the tripod, and set it back down.

Hold and turn the tripod around to a different orientation.

Video Presentation on Unseen Objects from 3D-FUTURE

Push the largebox, and set it back down.

Pull the largetable, and set it back down.

Pull the largetable, and set it back down.

Comparing to Other Methods on Unseen Objects

OMOMO

CHOIS

HOI-Diff

BibTeX

@misc{li2025genhoigeneralizingtextdriven4d,
        title={GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects}, 
        author={Shujia Li and Haiyu Zhang and Xinyuan Chen and Yaohui Wang and Yutong Ban},
        year={2025},
        eprint={2506.15483},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2506.15483}, 
  }e