Pre-print: arXiv:2603.18988. Code: github.com/HRI-EU/merge.
GROUND - Honda Research Institute USA
GROUND
GROUND (Group Reasoning for Object-centric Understanding of Narrative Dynamics) is a benchmark for fine-grained situational grounding of multi-person and human–robot collaborative interactions.
Introduction
GROUND (Group Reasoning for Object-centric Understanding of Narrative Dynamics) is a benchmark for fine-grained situational grounding of multi-person and human–robot collaborative interactions. Every event is annotated as a structured actor–action–object tuple with persistent participant identities and timestamps, supporting role-aware reasoning over how actors, objects, and relations evolve through a collaboration — a setting that prior datasets, which focus on single-actor or egocentric activities, do not adequately cover.
The dataset is split into two complementary subsets that share a tabletop setup with multiple actors but differ in recording location, background, and task — providing diversity to demonstrate generalization beyond a single environment:
- GROUND-Train – a subset for training and evaluating fine-grained action detection and segmentation.
- GROUND-Eval – an independently recorded evaluation subset annotated with structured actor–action–object relations for event-level reasoning.
Video
GROUND-Train
GROUND-Train comprises 198 unique scenarios, each simultaneously recorded from four distinct camera viewpoints (C1–C4), yielding 792 synchronized video sequences (1920 × 1080 @ 30 fps). The videos capture diverse group configurations — single-person, dyadic, and triadic interactions (1–3 participants) — in which individuals prepare drinks following different recipes, requiring both individual actions and coordinated group activities.
Each frame is comprehensively annotated with:
- Per-person fine-grained action labels, person-wise segmented in time.
- Human bounding boxes and 2D body-pose estimations, linked across the four camera views via cross-view tracking.
- Object bounding boxes with semantic categories.
- Collaborative action labels: Handover, Collaborative Pour, Collaborative Twist, and Collaborative Drop.
Vocabulary Scope
- 95 unique action classes (e.g., hold shaker, place_down glass, handover glass).
- 19 distinct nouns (e.g., shaker, cutting_board, glass, muddler).
- 13 verbs: idle, grasp, handover, cut, place_down, drop, twist, hold, pour, squash, shake, push, stir.
To our knowledge, GROUND-Train is the first multi-view dataset with diverse annotations — including action segmentation, pose, and bounding boxes — for studying multi-person collaborative instructional activities, offering a valuable benchmark for robotics and computer vision.
GROUND-Eval
GROUND-Eval provides an independent evaluation subset, recorded and annotated in a different environment than GROUND-Train, with two persons and a robot in tabletop scenarios captured from the robot's perspective (1024 × 768 RGB frames). It contains 16 recordings spanning three collaboration scenarios:
- Sorting Fruits (sf) — a banana, two apples, and an orange are sorted into a bowl or onto a plate (8 recordings).
- Pouring (po) — a bottle is used to pour liquid into one of several cups (4 recordings).
- Handover (ha) — participants hand various items to each other around the table (4 recordings).
Each scenario is repeated two times in different constellations, where a constellation specifies how many persons (and whether the robot) take part:
| Constellation | Description |
|---|---|
|
1P |
1 person performs actions alone |
| 2P | 2 people perform actions independently or interact |
| 1P+R | 1 person and the robot perform actions independently or interact |
| 2P+R | 2 people and the robot perform actions independently or interact |
Each Eval frame is annotated with:
- The full scene image.
- A cropped image of each person instance acting in the scene, labeled by ID.
- Sample images of object instances appearing in the scene, labeled by ID.
- A full event description: person ID, action label, whether the robot interacts with the acting person, object ID, and — if applicable — the spatial relation between two objects participating in the action.
Atomic actions observed in GROUND-Eval include idle, grasp, hold, place_down, pour, and handover.
Data Structure
GROUND-Train (multi-view, third-person, 4 cameras):
Train/
├── videos/
│ ├── C1/ P<id>_T*_V*_C1.mp4 (198 videos)
│ ├── C2/ P<id>_T*_V*_C2.mp4 (198 videos)
│ ├── C3/ P<id>_T*_V*_C3.mp4 (198 videos)
│ └── C4/ P<id>_T*_V*_C4.mp4 (198 videos)
├── box_annotations/ # 792 JSON files
│ └── P<id>_T*_V*_C*.json
└── action_annotations/ # 198 .xlsm files
└── P<id>_T*_V*.xlsm
Each video is named P<subjectID>_T<trial>_V<version>_C<camera>.mp4. The same T_V index is shared across the four cameras, so the four C1–C4 files form temporally synchronized views of one session.
GROUND-Eval (from the robot's perspective, tabletop):
Eval/
└── scene_<id>_<type>/ # 16 scenes (sf / po / ha × constellations)
├── images/ # RGB frames (1024 × 768)
├── object_images/ # Object reference crops, labeled by ID
├── person_images/ # Participant reference crops, labeled by ID
└── ground_truth.json # Event timeline, indexed by person ID then timestamp
ground_truth.json is indexed first by person ID (one top-level key per participant in the scene, two for 2P scenes) and then by timestamp, with a list of events per timestamp:
{
"<person_id>": {
"<timestamp>": [
{ "object": "object_2", "action": "grasp",
"robot_interaction": false, "on": "" }
]
}
}
Within each event, object and on reference the per-scene object identities provided in object_images/, action is one of the atomic actions listed above, and robot_interaction flags whether the robot is interacting with the acting person.
Citation
This dataset corresponds to "MERGE: Guided Vision-Language Models for Multi-Actor Event Reasoning and Grounding in Human–Robot Interaction," published at the IEEE International Conference on Robotics and Automation (ICRA) 2026.
@inproceedings{deigmoeller2026merge,
title = {MERGE: Guided Vision-Language Models for Multi-Actor Event Reasoning
and Grounding in Human--Robot Interaction},
author = {Deigmoeller, Joerg and Agarwal, Nakul and Hasler, Stephan and
Tanneberg, Daniel and Belardinelli, Anna and Ghoddoosian, Reza and
Wang, Chao and Ocker, Felix and Zhang, Fan and Dariush, Behzad and
Gienger, Michael},
booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
year = {2026}
}
Access & License
The dataset is available for non-commercial research purposes. Requesters must be affiliated with a university and use institutional email credentials. Submit requests via the official download form.