Each image has associated questions that interrogates the relations between objects in the scene. For example, a question about the image above might ask: “There is a tiny rubber thing that is the same colour as the large cylinder; what shape is it?
State-of-the-art results on CLEVR using standard visual question answering architectures are 68.5%, compared to 92.5% for humans. But using our RN-augmented network, we were able to show super-human performance of 95.5%.
To check the versatility of the RN, we also tested the RN on a very different language task. Specifically, we used the bAbI suite – a series of of text-based question answering tasks. bAbI consists of a number of stories, which are a variable number of sentences culminating in a question. For example, “Sandra picked up the football” and “Sandra went to the office” may lead to the question “Where is the football?” (answer: “office”).
The RN-augmented network scored more than 95% on 18 of the 20 bAbI tasks, similar to existing state-of-the-art models. Notably, it scored better on certain tasks – such as induction – which caused problems for these more established models.
Full results of all these tests and more are available in the paper.
Visual Interaction Networks
Another key part of relational reasoning involves predicting the future in a physical scene. From just a glance, humans can infer not only what objects are where, but also what will happen to them over the upcoming seconds, minutes and even longer in some cases. For example, if you kick a football against a wall, your brain predicts what will happen when the ball hits the wall and how their movements will be affected afterwards (the ball will ricochet at a speed proportional to the kick and – in most cases – the wall will remain where it is).
These predictions are guided by a sophisticated cognitive system for reasoning about objects and their physical interactions.
In this related work we developed the “Visual Interaction Network” (VIN) – a model that mimics this ability. The VIN is able to infer the states of multiple physical objects from just a few frames of video, and then use this to predict object positions many steps into the future. This differs from generative models, which might visually “imagine” the next few frames of a video. Instead, the VIN predicts how the underlying relative states of the objects evolve.