Adversarial NLI: A New Benchmark for Natural Language Understanding

0
Structured data

A study conducted in the UK from 2009 to 2010 by leading scientists explored neonatal resuscitation practices in various neonatal units, aiming to assess adherence to international guidelines and identify differences between tertiary and non-tertiary care providers...

Read on arXivCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

One Sentence Abstract

This study introduces a new iterative, human-and-model-in-the-loop collected NLI benchmark dataset, revealing state-of-the-art performance improvements while exposing current model weaknesses, and showcasing a continuously evolving data collection method for NLU.

Simplified Abstract

Researchers have created a new, large-scale dataset to help improve the understanding of how computers process human language. They collected this data using a special method that involved people and computers working together in an iterative process. This new dataset helped improve the performance of existing language processing models, while also revealing their limitations.

The method used in this study is like a game where humans and computers work together in rounds, with humans providing feedback on the computers' language understanding abilities. This collaborative process helps the computers become better at understanding human language.

The main finding of this research is that the new dataset, collected through this method, can help make language processing models more accurate and reliable. This is important because it helps us understand how countries work together in scientific collaborations, as language is a key factor in understanding and communicating research findings.

This study is significant because it introduces a new approach to creating language processing datasets. Instead of a fixed benchmark, this method creates a "moving target" scenario, constantly challenging the models and preventing them from becoming outdated. This approach improves the accuracy and reliability of the results, making it a valuable contribution to the field of natural language understanding.

Study Fields

Main fields:

  • Natural Language Understanding (NLU)
  • Natural Language Processing (NLP)

Subfields:

  • Dataset collection
  • Adversarial human-and-model-in-the-loop procedure
  • State-of-the-art performance
  • Shortcomings of current models
  • Non-expert annotators
  • Never-ending learning scenario

Study Objectives

  • Develop a new large-scale Natural Language Inference (NLI) benchmark dataset
  • Collect data via an iterative, adversarial human-and-model-in-the-loop procedure
  • Demonstrate improved performance of trained models on popular NLI benchmarks using the new dataset
  • Highlight challenges posed by the new dataset
  • Analyze shortcomings of current state-of-the-art models
  • Show that non-expert annotators can identify weaknesses in models
  • Propose a never-ending learning scenario for the data collection method, making it a dynamic target for Natural Language Understanding (NLU) rather than a static benchmark

Conclusions

  • A new large-scale Natural Language Inference (NLI) benchmark dataset is introduced, collected using an iterative, adversarial human-and-model-in-the-loop procedure.
  • Training models on this new dataset results in state-of-the-art performance on various popular NLI benchmarks while posing a more difficult challenge.
  • The new dataset highlights the limitations of current state-of-the-art models and demonstrates that non-expert annotators can effectively identify their weaknesses.
  • The data collection method provides a dynamic approach that can be applied in a never-ending learning scenario, continually challenging Natural Language Understanding (NLU) systems instead of becoming obsolete quickly.

References

Olga. Russakovsky, Jia. Deng, Hao. Su, Jonathan. Krause, Sanjeev. Satheesh, Sean. Ma, Zhiheng. Huang, Andrej. Karpathy, Aditya. Khosla, Michael. BernsteinInternational journal of computer vision
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252.
Gabor. Samuel R Bowman, Christopher. Angeli, Christopher.D. Potts, None. Manning
Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
Pranav. Rajpurkar, Jian. Zhang, Konstantin. Lopyrev, Percy. Liang
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
Alexis. Conneau, Douwe. Kiela
Alexis Conneau and Douwe Kiela. 2018. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449.
Alex. Wang, Amanpreet. Singh, Julian. Michael, Felix. Hill, Omer. Levy, None. Samuel R Bowman
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
Yann. Lecun, Léon. Bottou, Yoshua. Bengio, Patrick. Haffner
Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
Dan. Cireşan, Ueli. Meier, Jürgen. Schmidhuber
Dan Cireşan, Ueli Meier, and Jürgen Schmidhuber. 2012. Multi-column deep neural networks for image classification. arXiv preprint arXiv:1202.2745.
Li. Wan, Matthew. Zeiler, Sixin. Zhang, Yann. Le Cun, Rob. Fergus
Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. 2013. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058–1066.
Jia. Deng, Wei. Dong, Richard. Socher, Li-Jia. Li, Kai. Li, Li. Fei-Fei
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
Kaiming. He, Xiangyu. Zhang, Shaoqing. Ren, Jian. Sun
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
Jacob. Devlin, Ming-Wei. Chang, Kenton. Lee, Kristina. Toutanova
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Alex. Wang, Yada. Pruksachatkun, Nikita. Nangia, Amanpreet. Singh, Julian. Michael, Felix. Hill, Omer. Levy, None. Samuel R Bowman
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537.
Suchin. Gururangan, Swabha. Swayamdipta, Omer. Levy, Roy. Schwartz, Noah.A. Samuel R Bowman, None. Smith
Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A Smith. 2018. Annotation artifacts in natural language inference data. In Proceedings of NAACL.
Adam. Poliak, Jason. Naradowsky, Aparajita. Haldar, Rachel. Rudinger, Benjamin. Van Durme
Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics.
Masatoshi. Tsuchiya
Masatoshi Tsuchiya. 2018. Performance impact caused by hidden bias of training data for recognizing textual entailment. In Proceedings of LREC.
Max. Glockner, Vered. Shwartz, Yoav. Goldberg
Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking nli systems with sentences that require simple lexical inferences. In Proceedings of ACL.
Mor. Geva, Yoav. Goldberg, Jonathan. Berant
Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. arXiv preprint arXiv:1908.07898.
Tom. Mccoy, Ellie. Pavlick, Tal. Linzen
Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
Tom. Mitchell, William. Cohen, Estevam. Hruschka, Partha. Talukdar, Bo. Yang, Justin. Betteridge, Andrew. Carlson, B. Dalvi, Matt. Gardner, Bryan. KisielCommunications of the ACM
Tom Mitchell, William Cohen, Estevam Hruschka, Partha Talukdar, Bo Yang, Justin Betteridge, Andrew Carlson, B Dalvi, Matt Gardner, Bryan Kisiel, et al. 2018. Never-ending learning. Communications of the ACM, 61(5):103–115.
Zhilin. Yang, Saizheng. Zhang, Jack. Urbanek, Will. Feng, Alexander.H. Miller, Arthur. Szlam, Douwe. Kiela, Jason. Weston
Zhilin Yang, Saizheng Zhang, Jack Urbanek, Will Feng, Alexander H Miller, Arthur Szlam, Douwe Kiela, and Jason Weston. 2017. Mastering the dungeon: Grounded language learning by mechanical turker descent. arXiv preprint arXiv:1711.07950.
Allyson. Ettinger, Sudha. Rao, Hal. Daumé, Iii. None, Emily.M. Bender
Allyson Ettinger, Sudha Rao, Hal Daumé III, and Emily M Bender. 2017. Towards linguistically generalizable nlp systems: A workshop and shared task. arXiv preprint arXiv:1711.01505.
Emily. Dinan, Samuel. Humeau, Bharath. Chintagunta, Jason. Weston
Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. 2019. Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack. In Proceedings of EMNLP.
William. Shakespeare
William Shakespeare. 1603. The Tragedy of Hamlet, Prince of Denmark.
Adina. Williams, Nikita. Nangia, None. Samuel R Bowman
Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
Zhilin. Yang, Peng. Qi, Saizheng. Zhang, Yoshua. Bengio, William.W. Cohen, Ruslan. Salakhutdinov, Christopher.D. Manning
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
Danqi. Chen, Adam. Fisch, Jason. Weston, Antoine. Bordes
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Association for Computational Linguistics (ACL).
Yinhan. Liu, Myle. Ott, Naman. Goyal, Jingfei. Du, Mandar. Joshi, Danqi. Chen, Omer. Levy, Mike. Lewis, Luke. Zettlemoyer, Veselin. Stoyanov
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Yixin. Nie, Haonan. Chen, Mohit. Bansal
Yixin Nie, Haonan Chen, and Mohit Bansal. 2019. Combining fact extraction and verification with neural semantic matching networks. In Association for the Advancement of Artificial Intelligence (AAAI).
James. Thorne, Andreas. Vlachos, Christos. Christodoulopoulos, Arpit. Mittal
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355.
Nasrin. Mostafazadeh, Nathanael. Chambers, Xiaodong. He, Devi. Parikh, Dhruv. Batra, Lucy. Vanderwende, Pushmeet. Kohli, James. Allen
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A corpus and evaluation framework for deeper understanding of commonsense stories. arXiv preprint arXiv:1604.01696.
Felix. Hill, Antoine. Bordes, Sumit. Chopra, Jason. Weston
Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The goldilocks principle: Reading children's books with explicit memory representations. arXiv preprint arXiv:1511.02301.
Luisa. Bentivogli, Ido. Dagan, Hoa. None, Trang. Dang, Danilo. Giampiccolo, Bernardo. Magnini
Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The Fifth PASCAL Recognizing Textual Entailment Challenge. TAC.
Zhilin. Yang, Zihang. Dai, Yiming. Yang, Jaime. Carbonell, Ruslan. Salakhutdinov, Quoc V. Le
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
Xiaodong. Liu, Pengcheng. He, Weizhu. Chen, Jianfeng. Gao
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019a. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504.
Aakanksha. Naik, Abhilasha. Ravichander, Norman. Sadeh, Carolyn. Rose, Graham. Neubig
Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. 2018. Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2340–2353, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Alexis. Conneau, Douwe. Kiela, Holger. Schwenk, Loïc. Barrault, Antoine. Bordes
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680. Association for Computational Linguistics.
Yichen. Gong, Heng. Luo, Jian. Zhang
Yichen Gong, Heng Luo, and Jian Zhang. 2018. Natural language inference over interaction space. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
Stanislaw. Antol, Aishwarya. Agrawal, Jiasen. Lu, Margaret. Mitchell, Dhruv. Batra, C. Lawrence Zitnick, Devi. Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
Allan. Jabri, Armand. Joulin, Laurens. Van Der Maaten
Allan Jabri, Armand Joulin, and Laurens Van Der Maaten. 2016. Revisiting visual question answering baselines. In European conference on computer vision, pages 727–739. Springer.
Yash. Goyal, Tejas. Khot, Douglas. Summers-Stay, Dhruv. Batra, Devi. Parikh
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913.
Atticus. Geiger, Ignacio. Cases, Lauri. Karttunen, Christopher. Potts
Atticus Geiger, Ignacio Cases, Lauri Karttunen, and Christopher Potts. 2018. Stress-testing neural models of natural language inference with multiply-quantified sentences. arXiv preprint arXiv:1810.13033.
Divyansh. Kaushik, Zachary.C. Lipton
Divyansh Kaushik and Zachary C Lipton. 2018. How much reading does reading comprehension require? a critical investigation of popular benchmarks. arXiv preprint arXiv:1808.04926.
Robin. Jia, Percy. Liang
Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of EMNLP.
John. Wieting, Douwe. Kiela
John Wieting and Douwe Kiela. 2019. No training required: Exploring random encoders for sentence classification. arXiv preprint arXiv:1901.10444.
Yonatan. Belinkov, Yonatan. Bisk
Yonatan Belinkov and Yonatan Bisk. 2017. Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173.
Chinnadhurai. Sankar, Sandeep. Subramanian, Christopher. Pal, Sarath. Chandar, Yoshua. Bengio
Chinnadhurai Sankar, Sandeep Subramanian, Christopher Pal, Sarath Chandar, and Yoshua Bengio. 2019. Do neural dialog systems use the conversation history effectively? an empirical study. arXiv preprint arXiv:1906.01603.
Benjamin. Recht, Rebecca. Roelofs, Ludwig. Schmidt, Vaishaal. Shankar
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2018. Do cifar-10 classifiers generalize to cifar-10? arXiv preprint arXiv:1806.00451.
Le. Ronan, Swabha. Bras, Chandra. Swayamdipta, Rowan. Bhagavatula, Matthew.E. Zellers, Ashish. Peters, Yejin. Sabharwal, None. Choi
Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew E Peters, Ashish Sabharwal, and Yejin Choi. 2020. Adversarial filters of dataset biases. arXiv preprint arXiv:2002.04108.
Divyansh. Kaushik, Eduard. Hovy, Zachary.C. Lipton
Divyansh Kaushik, Eduard Hovy, and Zachary C Lipton. 2019. Learning the difference that makes a difference with counterfactually-augmented data. arXiv preprint arXiv:1909.12434.
Andrew. Ruef, Michael. Hicks, James. Parker, Dave. Levin, Michelle.L. Mazurek, Piotr. Mardziel
Andrew Ruef, Michael Hicks, James Parker, Dave Levin, Michelle L Mazurek, and Piotr Mardziel. 2016. Build it, break it, fix it: Contesting secure development. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 690–703. ACM.
Bernhard. Kratzwald, Stefan. Feuerriegel
Bernhard Kratzwald and Stefan Feuerriegel. 2019. Learning from on-line user feedback in neural question answering on the web. In The World Wide Web Conference, pages 906–916. ACM.
Huan. Ling, Sanja. Fidler
Huan Ling and Sanja Fidler. 2017. Teaching machines to describe images via natural language feedback. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5075–5085.
Eric. Wallace, Pedro. Rodriguez, Shi. Feng, Ikuya. Yamada, Jordan. Boyd-Graber
Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and Jordan Boyd-Graber. 2019. Trick me if you can: Human-in-the-loop generation of adversarial question answering examples. In Transactions of the Association for Computational Linguistics.
Wuwei. Lan, Siyu. Qiu, Hua. He, Wei. Xu
Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. A continuously growing dataset of sentential paraphrases. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1224–1234, Copenhagen, Denmark. Association for Computational Linguistics.
Rowan. Zellers, Yonatan. Bisk, Roy. Schwartz, Yejin. Choi
Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of EMNLP.
Rowan. Zellers, Ari. Holtzman, Yonatan. Bisk, Ali. Farhadi, Yejin. Choi
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of ACL.
Michael. Chen, Mike D'. Arcy, Alisa. Liu, Jared. Fernandez, Doug. Downey
Michael Chen, Mike D'Arcy, Alisa Liu, Jared Fernandez, and Doug Downey. 2019. CODAH: an adversarially authored question-answer dataset for common sense. CoRR, abs/1904.04365.
Denis. Paperno, Germán. Kruszewski, Angeliki. Lazaridou, Ngoc. Quan, Raffaella. Pham, Sandro. Bernardi, Marco. Pezzelle, Gemma. Baroni, Raquel. Boleda, None. Fernández
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. 2016. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031.
Chandra. Bhagavatula, Le. Ronan, Chaitanya. Bras, Keisuke. Malaviya, Ari. Sakaguchi, Hannah. Holtzman, Doug. Rashkin, Scott. Downey, Yejin. Wen-Tau Yih, None. Choi
Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Scott Wen-tau Yih, and Yejin Choi. 2019. Abductive commonsense reasoning. arXiv preprint arXiv:1908.05739.
Yixin. Nie, Mohit. Bansal
Yixin Nie and Mohit Bansal. 2017. Shortcut-stacked sentence encoders for multi-domain inference. arXiv preprint arXiv:1708.02312.

References

Unlock full article access by joining Solve