Algorithms only know what they are instructed or what they can learn from data. That means that machine learning algorithms typically express only what they can ‘see’ in a picture. This limits their usefulness as it repeats the obvious. This research team experiments with a way to draw meaning from the full text so that captions can present more context.
- Ali Furkan Biten, Computer Vision Center, UAB, Spain
- Lluís Gómez, Computer Vision Center, UAB, Spain
- Marccal Rusinol, Computer Vision Center, UAB, Spain
- Dimosthenis Karatzas, Computer Vision Center, UAB, Spain
‘Current image captioning systems perform at a merely descriptive level, essentially enumerating the objects in the scene and their relations. Humans, on the contrary, interpret images by integrating several sources of prior knowledge of the world. In this work, we aim to take a step closer to producing captions that offer a plausible interpretation of the scene, by integrating such contextual information into the captioning pipeline. For this we focus on the captioning of images used to illustrate news articles. We propose a novel captioning method that is able to leverage contextual information provided by the text of news articles associated with an image. Our model is able to selectively draw information from the article guided by visual cues, and to dynamically extend the output dictionary to out-of vocabulary named entities that appear in the context source. Furthermore we introduce “GoodNews”, the largest news image captioning dataset in the literature and demonstrate stateof-the-art results.’
SEE FULL PAPER Online repository [free]