InteractMove: Text-Controlled Human-Object Interaction Generation in 3D Scenes with Movable Objects
Abstract
In this paper, we propose a novel task of text-controlled human object interaction generation in 3D scenes with movable objects. Existing human-scene interaction datasets suffer from insufficient interaction categories and typically only consider interactions with static objects (do not change object positions), and the collection of such datasets with movable objects is difficult and costly. To address this problem, we construct the InteractMove dataset for Movable Human-Object Interaction in 3D Scenes by aligning existing human object interaction data with scene contexts, featuring three key characteristics: 1) scenes containing multiple movable objects with text-controlled interaction specifications (including same-category distractors requiring spatial and 3D scene context understanding), 2) diverse object types and sizes with varied interaction patterns (one-hand, two-hand, etc.), and 3) physically plausible object manipulation trajectories. With the introduction of various movable objects, this task becomes more challenging, as the model needs to identify objects to be interacted with accurately, learn to interact with objects of different sizes and categories, and avoid collisions between movable objects and the scene. To tackle such challenges, we propose a novel pipeline solution. We first use 3D visual grounding models to identify the interaction object. Then, we propose a hand-object joint affordance learning to predict contact regions for different hand joints and object parts, enabling accurate grasping and manipulation of diverse objects. Finally, we optimize interactions with local-scene modeling and collision avoidance constraints, ensuring physically plausible motions and avoiding collisions between objects and the scene. Comprehensive experiments demonstrate our method's superiority in generating physically plausible, text-compliant interactions compared to existing approaches.
Method Overview

Our construction process emphasizes the following key principles: (1) Movable target objects: Diverse objects are placed in semantically appropriate areas of the scene, including multiple distractors of the same category, to facilitate spatial understanding. (2) Physically Coherent Motion Alignment: Human motion sequences are adjusted to achieve realistic interactions with objects at different positions. (3) Scene-aware Filtering for Physical Plausibility: The aligned motion-scene pairs are filtered to remove cases violating physical constraints, such as foot-ground detachment, boundary overflow, or human-object collisions.

We begin with 3D object grounding using a pretrained grounding module with the text condition to identify its point cloud for the next stage. Next, we perform hand-object affordance learning uses an affordance diffusion module. This affordance represents the likelihood of interactions occurring between hand joints and object surfaces over time and is used to guide the interaction motion generation, enabling more accurate interactions aligned with object size and interaction semantics. Finally, we incorporate a collision-aware motion generation that voxelizes the region around the interactive object to evaluate spatial accessibility, combined with a collision-aware loss that enforces physically plausible motion and prevents interpenetration. Conditioned on the text, local scene, and learned affordance, our model generates physically plausible motion sequences that align with both interaction semantics and environmental constraints.
More Results
The person inspect cube on the table near the sofa.
The person eat apple on the desk.
The person drink bowl on the table near the bin.
BibTeX
@article{cai-etal-2025-interactmove,
title={InteractMove: Text-Controlled Human-Object Interaction Generation in 3D Scenes with Movable Objects},
author={Xinhao Cai and Minghang Zheng and Xin Jin and Yang Liu},
journal={Proceedings of the 32nd ACM International Conference on Multimedia},
year={2025}
}