Anomalous Toy Assembly (ATA) dataset

ATA dataset is the first public dataset to study sequential anomalies in instructional videos. 32 volunteers assembled three toys (airplane, table, and record player) in a lab environment. Each participant completed each assembly three times, resulting in nine total sequences per person for the three tasks. Four ZED Mini cameras recorded the participants from four viewpoints (front, side, overhead, and global).

Introduction

ATA dataset is the first public dataset to study sequential anomalies in instructional videos. 32 volunteers assembled three toys (airplane, table, and record player) in a lab environment. Each participant completed each assembly three times, resulting in nine total sequences per person for the three tasks. Four ZED Mini cameras recorded the participants from four viewpoints (front, side, overhead, and global). 


The ATA dataset includes 1152 untrimmed RGB videos, totaling 24.8 hours, with a resolution of 1920×1080 and a frame rate of 30 fps. The dataset contains 15 atomic actions such as ”fasten screw” and “take plate” and 11 error classes.

 

Dataset statistics



 


 

Data Splits

We split the participants into training, validation, and test sets consisting of 27, 1, and 4 participants, respectively. The validation and test sets include videos with sequential anomalies, defined as unexpected permutations of the training transcripts, such as redundant actions, skipped actions , and major changes in the order of training action subsequences. The dataset includes 96 training transcripts and two separate sets of 9 validation and 36 testing transcripts that are mutually exclusive from the training set. While all sets differ in transcripts, they share the same tasks and actions.

Dataset format

Each video/annotaion file is named as P@_$_T#_C&. In this format, @ is the ID of the participant, $ is the recording session number for each assembly, # corresponds to the task number (T1:plane, T2: table, T3: record player), and & indicates the camera view. Specifically, C1, C2, C3 and C4 correspond to the front, top, side and global view points respectively. All views are syncronized at a 30fps frame rate.

Json files for spatial annotations contain annotation for each keypoint (Left Shoulder, Right Shoulder, Belly Button, Head, Right Elbow, Left Elbow, Left Palm, Right Palm) for each frame at 30fps. The files also contain bounding box annotations (for all frames) of the object the participant is interacting with along with a label indentifying the class of the dataset.

Citation

This work has been accepted at ICCV 2023. Please cite this work if you use this dataset.

Download the dataset

The dataset is available for non-commercial usage. You must be affiliated with a university and use your university email address to make the request. Use this link to make the download request.