Learning to Attend to Salient Targets in Driving Videos Using Fully Convolutional RNN
IEEE International Conference on Intelligent Transportation Systems (ITSC) 2018
Driving involves the processing of rich audio, visual and haptic signals to make safe and calculated decisions on the road. Human vision plays a crucial role in this task and analysis of the gaze behavior could provide some insights into the action the driver takes upon seeing an object/region. A typical representation of gaze behavior is a saliency map. The work in this paper aims to predict this saliency map given a sequence of image frames. Strategies are developed to address important topics for video saliency including active gaze (i.e. gaze that is useful for driving), pixel- and object level information, and suppression of non-negative pixels in the saliency maps. These strategies enabled the development of a novel pixel- and object-level saliency ground truth dataset using real-world driving data around traffic intersections. We further proposed a fully convolutional RNN architecture capable of handling time sequence image data to estimate the saliency map. Our methodology shows promising results.