InterPet4D: A Multimodal 4D Human–Pet
Interaction Dataset for Pet Motion Generation

1Institute of Science Tokyo   2Carnegie Mellon University
*Equal contribution

InterPet4D is the first large-scale multimodal 4D dataset of natural human–dog interactions, with synchronized multi-view + egocentric video, audio, and 3D motion of both humans and dogs.

Abstract

Human–pet interaction estimation and generation remain underexplored due to the absence of high-quality large-scale datasets. We present InterPet4D, the first multimodal dataset capturing natural interactions between humans and dogs. Using a synchronized multi-view capture system, we record human–dog obedience tasks and provide annotations for both humans and dogs, including multiview and egocentric videos, segmentations, 2D/3D keypoints, meshes, and audio tracks. InterPet4D consists of 6.8 million frames collected from 13 dogs of 11 breeds interacting with 23 human participants.

Results

Data Capture

Motion capture setup

Qualitative Examples

Example 1 — Jump, Turn, Back

Example 2 — Sit, Hand, Petting