Human–pet interaction estimation and generation remain underexplored due to the absence of high-quality large-scale datasets. We present InterPet4D, the first multimodal dataset capturing natural interactions between humans and dogs. Using a synchronized multi-view capture system, we record human–dog obedience tasks and provide annotations for both humans and dogs, including multiview and egocentric videos, segmentations, 2D/3D keypoints, meshes, and audio tracks. InterPet4D consists of 6.8 million frames collected from 13 dogs of 11 breeds interacting with 23 human participants.
Example 1 — Jump, Turn, Back
Example 2 — Sit, Hand, Petting