Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

0
Structured data

A study conducted in the UK from 2009 to 2010 by leading scientists explored neonatal resuscitation practices in various neonatal units, aiming to assess adherence to international guidelines and identify differences between tertiary and non-tertiary care providers...

Read on arXivCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

One Sentence Abstract

This abstract discusses a new method that uses web images and tags alongside fully annotated datasets to train a deep neural network for cross-modal retrieval between visual data and natural language, achieving significant performance gains over state-of-the-art approaches.

Simplified Abstract

Researchers are working on improving how computers can connect images and words – a tricky task in the world of multimedia. Previous methods had limitations because they learned from a small number of images with exact word descriptions, which could make the results unreliable. To solve this, the researchers looked at a creative way to use the internet, where there are lots of images, but the descriptions can be messy. They discovered a two-step process to teach computers to better understand the connections between images and words.

The new method is better than earlier ones, as shown in experiments using two standard tests. This advancement is important because it helps computers understand connections between countries working together on science projects by "reading" the images and words they share. By making this process more accurate and reliable, the researchers hope to improve our understanding of how countries collaborate in the field of science.

Study Fields

Main fields:

  • Multimedia retrieval
  • Deep learning
  • Webly supervised learning

Subfields:

  • Cross-modal retrieval between visual data and natural language description
  • Image-text retrieval
  • Deep representations aligned across modalities
  • Small-scale datasets
  • Annotating millions of images with sentences
  • Bias in model training
  • Web images with noisy annotations
  • Joint representation learning
  • Visual-semantic joint embedding
  • Two-stage approach for image-text retrieval
  • Supervised pair-wise ranking loss
  • Performance gain in image-text retrieval

Study Objectives

  • To address the challenge of cross-modal retrieval between visual data and natural language description in multimedia
  • To investigate the possibility of using web images with noisy annotations for learning robust image-text joint representation
  • To propose a two-stage approach for the task that can augment a typical supervised pair-wise ranking loss-based formulation with weakly-annotated web images to learn a more robust visual-semantic embedding
  • To evaluate the performance of the proposed method on two standard benchmark datasets and compare it with state-of-the-art approaches

Conclusions

  • The article addresses the challenge of cross-modal retrieval between visual data and natural language descriptions in multimedia.
  • It highlights that recent image-text retrieval methods face issues such as limited coverage and high costs for larger datasets.
  • The authors propose a solution inspired by webly supervised learning, leveraging readily-available web images and corresponding tags along with fully annotated datasets for learning robust image-text joint representation.
  • The proposed approach uses a two-stage method that augments a typical supervised pair-wise ranking loss-based formulation with weakly annotated web images for better visual-semantic embedding.
  • Experiments on two standard benchmark datasets show significant performance gains in image-text retrieval compared to state-of-the-art approaches.

References

Xinlei. Chen, Hao. Fang, Tsung-Yi. Lin, Ramakrishna. Vedantam, Saurabh. Gupta, Piotr. Dollár, C. Lawrence, Zitnick. None
. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015). arXiv preprint arXiv:1504.00325
Liwei. Bryan A Plummer, Chris.M. Wang, Juan.C. Cervantes, Julia. Caicedo, Svetlana. Hockenmaier, None. Lazebnik
. Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In International Conference on Computer Vision. IEEE, 2641–2649. International Conference on Computer Vision
Jonathan. Krause, Benjamin. Sapp, Andrew. Howard, Howard. Zhou, Alexander. Toshev, Tom. Duerig, James. Philbin, Li. Fei-Fei
. Jonathan Krause, Benjamin Sapp, Andrew Howard, Howard Zhou, Alexander Toshev, Tom Duerig, James Philbin, and Li Fei-Fei. 2016. The unreasonable effectiveness of noisy data for fine-grained recognition. In European Conference on Computer Vision. Springer, 301–320. European Conference on Computer Vision
None. Emiel Van Miltenburg
Emiel van Miltenburg. 2016. Stereotyping and bias in the flickr30k dataset. arXiv preprint arXiv:1605.06083 (2016). arXiv preprint arXiv:1605.06083
Zeyuan. Hu, Julia. Strout
Zeyuan Hu and Julia Strout. 2018. Exploring Stereotypes and Biased Data with the Crowd. arXiv preprint arXiv:1801.03261 (2018). arXiv preprint arXiv:1801.03261
Jieyu. Zhao, Tianlu. Wang, Mark. Yatskar, Vicente. Ordonez, Kai-Wei. Chang
. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2017. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457 (2017). arXiv preprint arXiv:1707.09457
Andrej. Karpathy, Li. Fei-Fei
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3128–3137. IEEE Conference on Computer Vision and Pattern Recognition
J. Mark, Michael.S. Huiskes, None. Lew
Mark J Huiskes and Michael S Lew. 2008. The MIR flickr retrieval evaluation. In International Conference on Multimedia Information Retrieval. ACM, 39–43. International Conference on Multimedia Information Retrieval
Vladimir. Vapnik, Akshay. Vashist
Vladimir Vapnik and Akshay Vashist. 2009. A new learning paradigm: Learning using privileged information. Neural networks 22, 5-6 (2009), 544–557. Neural networks
Viktoriia. Sharmanska, Novi. Quadrianto, Christoph.H. Lampert
. Viktoriia Sharmanska, Novi Quadrianto, and Christoph H Lampert. 2013. Learning to rank using privileged information. In International Conference on Computer Vision. IEEE, 825–832. International Conference on Computer Vision
Sebastian. Ruder
Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017). arXiv preprint arXiv:1706.05098
Joachim. Bingel, Anders. Søgaard
Joachim Bingel and Anders Søgaard. 2017. Identifying beneficial task relations for multi-task learning in deep neural networks. In 15th Conference of the European Chapter of the Association for Computational Linguistics, Vol. 2. 164–169. 15th Conference of the European Chapter of the Association for Computational Linguistics
Tsung-Yi. Lin, Michael. Maire, Serge. Belongie, James. Hays, Pietro. Perona, Deva. Ramanan, Piotr. Dollár, C. Lawrence, Zitnick. None
. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755. European conference on computer vision
Liwei. Wang, Yin. Li, Svetlana. Lazebnik
. Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5005–5013. IEEE Conference on Computer Vision and Pattern Recognition
Benjamin. Klein, Guy. Lev, Gil. Sadeh, Lior. Wolf
. Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4437–4446. IEEE Conference on Computer Vision and Pattern Recognition
Feiran. Huang, Xiaoming. Zhang, Zhoujun. Li, Tao. Mei, Yueying. He, Zhonghua. Zhao
. Feiran Huang, Xiaoming Zhang, Zhoujun Li, Tao Mei, Yueying He, and Zhonghua Zhao. 2017b. Learning Social Image Embedding with Deep Multimodal Attention Networks. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017. ACM, 460–468. Proceedings of the on Thematic Workshops of ACM Multimedia 2017
Niluthpol. Chowdhury Mithun, Juncheng. Li, Florian. Metze, Amit K Roy-Chowdhury. None
. Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. 2018. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. In International Conference on Multimedia Retrieval. ACM, 19–27. International Conference on Multimedia Retrieval
Junhua. Mao, Wei. Xu, Yi. Yang, Jiang. Wang, Zhiheng. Huang, Alan. Yuille
. Junhua Mao, Wei Xu, Yi...

References

Unlock full article access by joining Solve