Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering

Somak Aditya, Yezhou Yang,  Chitta Baral


Many vision and language tasks require commonsense reasoning beyond data-driven image and natural language processing. Here we adopt Visual Question Answering (VQA)
as an example task, where a system is expected to answer a question in natural language about an image. Current state-of-the-art systems attempted to solve the task using deep neural architectures and achieved promising performance. However, the resulting systems are generally opaque and they struggle in understanding questions for which extra knowledge is required. In this paper, we present an explicit reasoning layer on
top of a set of penultimate neural network based systems. The reasoning layer enables reasoning and answering questions where additional knowledge is required, and at the same time provides an interpretable interface to the end users. Specifically, the reasoning layer adopts a Probabilistic Soft Logic (PSL) based engine to reason over a basket of inputs: visual relations, the semantic parse of the question, and background ontological knowledge from word2vec and ConceptNet. Experimental analysis of the answers and the key evidential predicates generated on the VQA dataset validate our approach.

Full Paper:

Please download our paper from here.

Full Architecture:

The architecture followed by this paper. In this example, the reasoning engine figures out that barn is a more likely answer, based on the evidence: i) question asks for a building and barn is a building (ontological), ii) barn is more likely than church as it relates closely (distributional) to other concepts in the image: horses, fence detected from Dense Captions. Such ontological and distributional knowledge is obtained from ConceptNet and word2vec. They are encoded as similarity metrics for seamless integration with PSL.

A Complete Example:

We provide a complete step-by-step example starting from an image and a question to inferring the answer in our first blog post.

Motivating Results:

We design a Probabilistic Soft Logic based reasoning engine that reasons with various knowledge sources and the structured information from an image and a question. The reasoning rules are designed in a generic manner and the rule-base is inspired from fuzzy graph matching: i) the question and the scene information is converted into semantic graphs, ii) the question has a node labeled “?x” and the rules are designed to find the candidate answer node that best matches “?x”.

The above rule-base suffices to answer a significant broad category of questions (we denote them as “specific”. Along with improving the accuracy of some of the questions, we are able to provide evidence from scene and knowledge-graph to “explain” the answer produced by our reasoning engine.

Download Links:

Visual Genome Caption and manual relation annotation: TSV File.

  • In this file, we provide noun-pairs and their corresponding relations from different captions (First column: Caption, Second Column: Noun pair, 3rd column: Relation, 4th column: Probable relations)



    title={Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering},
    author={Aditya, Somak and Yang, Yezhou and Baral, Chitta},