Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

0
Structured data

A study conducted in the UK from 2009 to 2010 by leading scientists explored neonatal resuscitation practices in various neonatal units, aiming to assess adherence to international guidelines and identify differences between tertiary and non-tertiary care providers...

Read on arXivCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

One Sentence Abstract

"Developing a bidirectional retrieval model that embeds image object fragments and sentence dependency tree relations into a common space, improving image-sentence retrieval tasks and providing interpretable predictions through explicit fragment alignment."

Simplified Abstract

Researchers have developed a new way to find connections between images and sentences, which helps understand scientific collaborations between countries. Instead of just looking at the whole image or sentence, this new method focuses on smaller parts of each (like objects in images and sentence relationships). It uses a special technique called multi-modal embedding to combine information from both images and sentences in a way that makes it easier to find links between them.

This new approach also uses a new objective, called fragment alignment, which helps learn how these parts in images and sentences are related. By testing this new method, researchers found that it works better than other methods in finding connections between images and sentences. This new method also helps explain how these connections are made, which is useful for understanding scientific collaborations.

Study Fields

Main fields:

  • Multi-modal embedding
  • Image and sentence retrieval
  • Visual and natural language data

Subfields:

  • Common embedding space
  • Objects and typed dependency tree relations
  • Ranking objective
  • Fragment alignment objective
  • Image-sentence retrieval tasks
  • Inter-modal fragment alignment
  • Interpretable predictions

Study Objectives

  • Develop a model for bidirectional retrieval of images and sentences by embedding visual and natural language data in a multi-modal space
  • Embed fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space, as opposed to directly mapping images or sentences into a common embedding space
  • Incorporate a fragment alignment objective to learn direct associations between fragments across modalities
  • Evaluate the performance of the model on image-sentence retrieval tasks and show that reasoning at both the global and fine-grained levels significantly improves performance
  • Provide interpretable predictions by making the inferred inter-modal fragment alignment explicit

Conclusions

  • The introduced model utilizes a multi-modal embedding approach to perform bidirectional retrieval of images and sentences, focusing on fragments of images (objects) and fragments of sentences (typed dependency tree relations).
  • Unlike previous models, this approach allows for the addition of a fragment alignment objective that learns to directly associate fragments across modalities, in addition to a ranking objective.
  • Experimental evaluation demonstrates that reasoning on both the global level of images and sentences and the finer level of their respective fragments significantly improves performance on image-sentence retrieval tasks.
  • The model provides interpretable predictions since the inferred inter-modal fragment alignment is explicit.

References

M.C. De Marneffe, B. Maccartney, C.D. Manning
De Marneffe, M.C., MacCartney, B., Manning, C.D., et al.: Generating typed dependency parses from phrase structure parses. In: Proceedings of LREC. Volume 6. (2006) 449–454
C. Rashtchian, P. Young, M. Hodosh, J. Hockenmaier
Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Association for Computational Linguistics (2010) 139–147
M. Hodosh, P. Young, J. HockenmaierJournal of Artificial Intelligence Research
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research (2013)
P. Young, A. Lai, M. Hodosh, J. HockenmaierTACL
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL (2014)
A. Farhadi, M. Hejrati, M.A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, D. Forsyth
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: Generating sentences from images. In: ECCV. (2010)
V. Ordonez, G. Kulkarni, T.L. Berg
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: Describing images using 1 million captioned photographs. In: NIPS. (2011)
G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg, T.L. Berg
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: Understanding and generating simple image descriptions. In: CVPR. (2011)
B.Z. Yao, X. Yang, L. Lin, M.W. Lee, S.C. Zhu
Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2t: Image parsing to text description. Proceedings of the IEEE 98(8) (2010) 1485–1508 98
Y. Yang, C.L. Teo, Iii. Daumé, H. Aloimonos, Y. None
Yang, Y., Teo, C.L., Daumé III, H., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: EMNLP. (2011)
S. Li, G. Kulkarni, T.L. Berg, A.C. Berg, Y. Choi
Li, S., Kulkarni, G., Berg, T.L., Berg, A.C., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: CoNLL. (2011)
M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, None. Daumé, H. Iii
Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., Yamaguchi, K., Berg, T., Stratos, K., Daumé, III, H.: Midge: Generating image descriptions from computer vision detections. In: EACL. (2012)
P. Kuznetsova, V. Ordonez, A.C. Berg, T.L. Berg, Y. Choi
Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: ACL. (2012)
R. Socher, L. Fei-Fei
Socher, R., Fei-Fei, L.: Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: CVPR. (2010)
C.L. Zitnick, D. Parikh, L. VanderwendeICCV
Zitnick, C.L., Parikh, D., Vanderwende, L.: Learning the visual interpretation of sentences. ICCV (2013)
N. Srivastava, R. Salakhutdinov
Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. In: NIPS. (2012)
R. Kiros, R.S. Zemel, R. Salakhutdinov
Kiros, R., Zemel, R.S., Salakhutdinov, R.: Multimodal neural language models. ICML (2014)
Y. Jia, M. Salzmann, T. Darrell
Jia, Y., Salzmann, M., Darrell, T.: Learning cross-modality similarity for multinomial data. In: ICCV. (2011)
K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D.M. Blei, M.I. JordanJMLR
Barnard, K., Duygulu, P., Forsyth, D., De Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and pictures. JMLR (2003)
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A.Y. Ng
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML. (2011)
A. Frome, G.S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: A deep visual-semantic embedding model. In: NIPS. (2013)
R. Socher, A. Karpathy, Q.V. Le, C.D. Manning, A.Y. NgTACL
Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. TACL (2014)
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11) (1998) 2278–2324 86
Q.V. Le
Le, Q.V.: Building high-level features using large scale unsupervised learning. In: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, IEEE (2013) 8595–8598
A. Krizhevsky, I. Sutskever, G.E. Hinton
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS. (2012)
M.D. Zeiler, R. Fergus
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional neural networks. arXiv preprint arXiv:1311.2901 (2013)
<FootnoteDefinition authors="R. Girshick, J. Donahue, T. Darrell, J. Malik" publication="" nu...

References

Unlock full article access by joining Solve