About the presenter: a senior scientist at Brain Corporation, San Diego, CA
ABSTRACT: Today autonomous robots remain confined to controlled laboratory settings, factory environments, or to artificial games. In part, this is because of challenges they face in making sense of the real world. Vision is a particularly important sense for autonomy alongside humans, but numerous dynamical physical regularities continually conspire to change the way things appear – including specular reflections, shadows, shifts in apparent hue, and motion blur, to name a few that we recognize and have names for. These regularities are far too numerous for experts to program in by hand, and depend on temporal and spatial context in ways that cannot be captured by purely feedforward “deep learning” networks. As a result, robots fail on simple real-world tasks. To tackle this problem, we looked to further ideas from cognitive science and neuroscience. Inspired by some theories of cortical processing, we hypothesized that a system capable of predicting the world and its dynamics will develop an understanding of it. We designed a scalable model with extensive feedback connectivity, which appears hierarchical but is actually massively recurrent. We trained the model in an unsupervised fashion on continuous video streams, with the sole task of learning to predict the next video frame. To assess the perceptual capabilities of the trained system, we froze learning in the network and applied supervised learning in a separate decoding or “readout” network attached to the model. The combined predictive vision model and readout network tracked objects in new videos with very good performance, even under challenging lighting conditions that robots might experience. The results suggest a new class of models that can learn on their own to predict, and therefore “understand”, the world around them.