Visual Relationship Detection with Language Priors

0
Structured data

A study conducted in the UK from 2009 to 2010 by leading scientists explored neonatal resuscitation practices in various neonatal units, aiming to assess adherence to international guidelines and identify differences between tertiary and non-tertiary care providers...

Read on arXivCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

One Sentence Abstract

This study proposes a model that trains individual object and predicate visual models, utilizes language priors from semantic word embeddings, and localizes objects in predicted relationships to improve content-based image retrieval by predicting multiple relationships per image, even with a few examples for each.

Simplified Abstract

Researchers are working on a new way to understand how objects interact with each other in pictures. Think of it as trying to figure out what's happening between two things in a photo, like "man riding a bicycle" or "man pushing a bicycle." There are lots of different interactions, but most of them don't happen very often. In the past, researchers focused on only a few interactions because it was hard to get enough examples of all the different interactions.

This new approach is different because it trains models to identify objects (like "man" and "bicycle") and actions (like "riding" and "pushing") separately. Then, it combines those models to predict many interactions in a single photo. To make the predictions better, the researchers use language clues from how words are related in sentences. This helps the model learn which interactions are more likely to happen.

The result? This method can learn about thousands of different interactions from just a few examples. It also helps identify where in the photo each interaction is happening. Lastly, understanding the relationships between objects in photos can improve how we search for and find images that we're looking for.

Study Fields

Main fields:

  • Visual Relationship Detection
  • Semantic Word Embeddings

Subfields:

  • Predicting relationships between objects
  • Object and predicate frequency
  • Model training and combination
  • Language priors and likelihood finetuning
  • Scalable relationship prediction
  • Object localization through bounding boxes
  • Content-based image retrieval improvement

Study Objectives

  • Develop a model for visual relationship detection that can predict multiple relationships per image
  • Train visual models for objects and predicates individually to overcome the limitation of insufficient training examples for all possible relationships
  • Utilize language priors from semantic word embeddings to fine-tune the likelihood of a predicted relationship
  • Demonstrate the model's ability to scale and predict thousands of types of relationships from a few examples
  • Localize the objects in the predicted relationships as bounding boxes in the image
  • Investigate the potential improvement of content-based image retrieval through understanding relationships

Conclusions

  • The authors propose a model that trains visual models for objects and predicates individually, then combines them to predict multiple relationships per image.
  • They leverage language priors from semantic word embeddings to fine-tune the likelihood of a predicted relationship, allowing the model to scale to predict thousands of types of relationships from a few examples.
  • The model also localizes the objects in the predicted relationships as bounding boxes in the image.
  • Understanding relationships can improve content-based image retrieval.

References

G. Zhou12, M. Zhang, D.H. Ji, Q. Zhu
ZHOU12, G., Zhang, M., Ji, D.H., Zhu, Q.: Tree kernel-based relation extraction with context-sensitive structured parse tree information. EMNLP-CoNLL 2007 (2007) 728
Z. Guodong, S. Jian, Z. Jie, Z. Min
GuoDong, Z., Jian, S., Jie, Z., Min, Z.: Exploring various knowledge in relation extraction. In: Proceedings of the 43rd annual meeting on association for computational linguistics, Association for Computational Linguistics (2005) 427–434
A. Culotta, J. Sorensen
Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics (2004) 423
R. Socher, B. Huval, C.D. Manning, A.Y. Ng
Socher, R., Huval, B., Manning, C.D., Ng, A.Y.: Semantic compositionality through recursive matrix-vector spaces. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics (2012) 1201–1211
M. Everingham, L. Van Gool, C.K. Williams, J. Winn, A. ZissermanInternational journal of computer vision
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision 88(2) (2010) 303–338 88
M.A. Sadeghi, A. Farhadi
Sadeghi, M.A., Farhadi, A.: Recognition using visual phrases. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE (2011) 1745–1752
T. Mikolov, K. Chen, G. Corrado, J. Dean
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
J. Johnson, R. Krishna, M. Stark, L.J. Li, D.A. Shamma, M. Bernstein, L. Fei-Fei
Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D.A., Bernstein, M., Fei-Fei, L.: Image retrieval using scene graphs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015)
T. Mensink, E. Gavves, C.G. Snoek
Mensink, T., Gavves, E., Snoek, C.G.: Costa: Co-occurrence statistics for zero-shot classification. In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, IEEE (2014) 2441–2448
R. Salakhutdinov, A. Torralba, J. Tenenbaum
Salakhutdinov, R., Torralba, A., Tenenbaum, J.: Learning to share visual appearance for multiclass object detection. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE (2011) 1481–1488
L. Ladicky, C. Russell, P. Kohli, P.H. Torr
Ladicky, L., Russell, C., Kohli, P., Torr, P.H.: Graph cut based inference with co-occurrence statistics. In: Computer Vision–ECCV 2010. Springer (2010) 239–253
A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, S. Belongie
Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context. In: Computer vision, 2007. ICCV 2007. IEEE 11th international conference on, IEEE (2007) 1–8
C. Galleguillos, A. Rabinovich, S. Belongie
Galleguillos, C., Rabinovich, A., Belongie, S.: Object categorization using co-occurrence, location and appearance. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE (2008) 1–8
C. Galleguillos, S. BelongieComputer Vision and Image Understanding
Galleguillos, C., Belongie, S.: Context based object categorization: A critical survey. Computer Vision and Image Understanding 114(6) (2010) 712–722 114
M.J. Choi, J.J. Lim, A. Torralba, A.S. Willsky
Choi, M.J., Lim, J.J., Torralba, A., Willsky, A.S.: Exploiting hierarchical context on a large database of object categories. In: Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, IEEE (2010) 129–136
H. Izadinia, F. Sadeghi, A. Farhadi
Izadinia, H., Sadeghi, F., Farhadi, A.: Incorporating scene context and object layout into appearance modeling. In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, IEEE (2014) 232–239
S. Fidler, A. Leonardis
Fidler, S., Leonardis, A.: Towards scalable representations of object categories: Learning a hierarchy of parts. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, IEEE (2007) 1–8
J. Sivic, B.C. Russell, A. Efros, A. Zisserman, W.T. Freeman
Sivic, J., Russell, B.C., Efros, A., Zisserman, A., Freeman, W.T., et al.: Discovering objects and their location in images. In: Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on. Volume 1., IEEE (2005) 370–377
S. Gould, J. Rodgers, D. Cohen, G. Elidan, D. KollerInternational Journal of Computer Vision
Gould, S., Rodgers, J., Cohen, D., Elidan, G., Koller, D.: Multi-class segmentation with relative location prior. International Journal of Computer Vision 80(3) (2008) 300–316 80
M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, B. Schiele
Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: Computer Vision (ICCV), 2013 IEEE International Conference on, IEEE (2013) 433–440
Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE (2010...

References

Unlock full article access by joining Solve