Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
A study conducted in the UK from 2009 to 2010 by leading scientists explored neonatal resuscitation practices in various neonatal units, aiming to assess adherence to international guidelines and identify differences between tertiary and non-tertiary care providers...
Read on arXiv
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
One Sentence Abstract
"Developing a bidirectional retrieval model that embeds image object fragments and sentence dependency tree relations into a common space, improving image-sentence retrieval tasks and providing interpretable predictions through explicit fragment alignment."
Simplified Abstract
Researchers have developed a new way to find connections between images and sentences, which helps understand scientific collaborations between countries. Instead of just looking at the whole image or sentence, this new method focuses on smaller parts of each (like objects in images and sentence relationships). It uses a special technique called multi-modal embedding to combine information from both images and sentences in a way that makes it easier to find links between them.
This new approach also uses a new objective, called fragment alignment, which helps learn how these parts in images and sentences are related. By testing this new method, researchers found that it works better than other methods in finding connections between images and sentences. This new method also helps explain how these connections are made, which is useful for understanding scientific collaborations.
Study Fields
Main fields:
- Multi-modal embedding
- Image and sentence retrieval
- Visual and natural language data
Subfields:
- Common embedding space
- Objects and typed dependency tree relations
- Ranking objective
- Fragment alignment objective
- Image-sentence retrieval tasks
- Inter-modal fragment alignment
- Interpretable predictions
Study Objectives
- Develop a model for bidirectional retrieval of images and sentences by embedding visual and natural language data in a multi-modal space
- Embed fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space, as opposed to directly mapping images or sentences into a common embedding space
- Incorporate a fragment alignment objective to learn direct associations between fragments across modalities
- Evaluate the performance of the model on image-sentence retrieval tasks and show that reasoning at both the global and fine-grained levels significantly improves performance
- Provide interpretable predictions by making the inferred inter-modal fragment alignment explicit
Conclusions
- The introduced model utilizes a multi-modal embedding approach to perform bidirectional retrieval of images and sentences, focusing on fragments of images (objects) and fragments of sentences (typed dependency tree relations).
- Unlike previous models, this approach allows for the addition of a fragment alignment objective that learns to directly associate fragments across modalities, in addition to a ranking objective.
- Experimental evaluation demonstrates that reasoning on both the global level of images and sentences and the finer level of their respective fragments significantly improves performance on image-sentence retrieval tasks.
- The model provides interpretable predictions since the inferred inter-modal fragment alignment is explicit.
References
- University of AI
Received 20 Oct 2011, Revised 9 Dec 2011, Accepted 5 Jan 2012, Available online 12 Jan 2012.





