Please help transcribe this video using our simple transcription tool. You need to be logged in to do so.
In recent years, the popularity of new sensing modalities like ladar and depth cameras have provided a rich new source of data for computer perception to explore. Understanding such data in context can be perhaps best understood as a problem of structured prediction. The traditional approach to structured prediction problems is to craft a graphical model structure, learn parameters for the model, and perform inference using an efficient and usually approximate inference approach, including, e.g., graph cut methods, belief propagation, and variational methods. Unfortunately, while remarkably powerful methods for inference have been developed and substantial theoretical insight has been achieved especially for simple potentials, the combination of learning and approximate inference for graphical models is still poorly understood and limited in practice. Within computer vision, for instance, there is a common belief that more sophisticated representations and energy functions are necessary to achieve high performance which are difficult for theoretically sound inference/learning procedures. An alternate view is to consider approximate inference as procedure: we can view an iterative procedure like belief propagation on a random field as a network of computational modules taking observations, other local computations on a graph (messages), and providing intermediate output messages and final output classifications over nodes in the random field. This approach has shown significant promise for the resulting quality of predictions at computer vision tasks, speed of inference and training, and theoretical understanding. The resulting network of predictive modules is often a tremendously deep (up to ~10^6 computational modules) one taking perceptual features to semantic predictions. We demonstrate that multi-modal data provides both new challenges and new advantages that are well addressed by inference machines. In particular, we show a particular structure that is appropriate to inference in such data. Further, we demonstrate that multi-modality enables very efficient use of unlabeled data to learn representations through co-regulariziation which encourages predictions from each modality to agree wherever they overlap. We relate the resulting approaches to previous techniques include CCA and graphical model approaches. Finally, we demonstrate performance on difficult problems in multi-modal scene understanding. This is joint work with Daniel Munoz and Martial Hebert.
Questions and AnswersYou need to be logged in to be able to post here.