Gaussian Error Linear Units (GELUs)

0
Structured data

A study conducted in the UK from 2009 to 2010 by leading scientists explored neonatal resuscitation practices in various neonatal units, aiming to assess adherence to international guidelines and identify differences between tertiary and non-tertiary care providers...

Read on arXivCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

One Sentence Abstract

The Gaussian Error Linear Unit (GELU) activation function, which weighs inputs by value and uses the standard Gaussian cumulative distribution function, outperforms ReLU and ELU functions across computer vision, natural language processing, and speech tasks.

Simplified Abstract

This research introduces a new method called the Gaussian Error Linear Unit (GELU) for neural networks. Think of a neural network as a tool that helps computers understand information, and an activation function as a step in that process that determines how the network processes information.

Traditional activation functions include ReLU and ELU. ReLU "switches on" when an input is positive and ELU when it's negative. GELU, on the other hand, "weights" inputs based on their value, like selecting the most important items in a list. It uses the standard Gaussian cumulative distribution function for this, which is a mathematical equation that describes how a bell-shaped curve is formed.

To test the effectiveness of GELU, the researchers compared its performance to ReLU and ELU in various tasks, such as analyzing images (computer vision), understanding written text (natural language processing), and interpreting speech. The results showed that GELU outperformed the other methods, making it a more accurate and reliable tool for these tasks.

In summary, this study introduces the GELU method, which improves the performance of neural networks in various applications. This innovation offers a new, more effective tool for scientists collaborating across countries to work with and improve their results.

Study Fields

Main fields:

  • Neural Networks
  • Activation Functions

Subfields:

  • Gaussian Error Linear Unit (GELU)
  • Standard Gaussian Cumulative Distribution Function (Φ(x))
  • ReLU (Rectified Linear Unit)
  • Empirical Evaluation
  • Computer Vision Tasks
  • Natural Language Processing Tasks
  • Speech Tasks

Study Objectives

  • Develop a high-performing neural network activation function called Gaussian Error Linear Unit (GELU)
  • Compare the performance of GELU, ReLU, and ELU activations in computer vision, natural language processing, and speech tasks
  • Demonstrate the improvement of GELU nonlinearity in empirical evaluations over ReLU and ELU activations

Conclusions

  • The Gaussian Error Linear Unit (GELU) is a high-performing neural network activation function that improves upon the ReLU and ELU activations.
  • GELU weights inputs by their value, whereas ReLU gates inputs by their sign. The GELU function, defined as xΦ(x), uses the standard Gaussian cumulative distribution function, Φ(x), to achieve this.
  • The study demonstrates that GELU outperforms ReLU and ELU across various computer vision, natural language processing, and speech tasks.
  • The empirical evaluation suggests that GELU is a promising activation function for neural networks, offering potential improvements in a range of applications.

References

John. Hopfield
John Hopfield. Neural networks and physical systems with emergent collective computational abilities. In Proceedings of the National Academy of Sciences of the USA, 1982.
Warren.S. Mcculloch, Walter. PittsBulletin of Mathematical Biophysics
Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. In Bulletin of Mathematical Biophysics, 1943.
Vinod. Nair, Geoffrey.E. Hinton
Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, 2010.
Djork-Arné. Clevert, Thomas. Unterthiner, Sepp. Hochreiter
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). In International Conference on Learning Representations, 2016.
Nitish. Srivastava, Geoffrey.E. Hinton, Alex. Krizhevsky, Ilya. Sutskever, Ruslan. SalakhutdinovJournal of Machine Learning Research
Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. In Journal of Machine Learning Research, 2014.
Philip. Bachman, Ouais. Alsharif, Doina. Precup
Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. In Neural Information Processing Systems, 2014.
Jimmy. Ba, Brendan. Frey
Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks. In Neural Information Processing Systems, 2013.
David. Krueger, Tegan. Maharaj, János. Kramár, Mohammad. Pezeshki, Nicolas. Ballas, Nan. Rosemary Ke1, Anirudh. Goyal, Yoshua. Bengio, Hugo. Larochelle, Aaron. Courville, Chris. Pal
David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke1, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron Courville, and Chris Pal. Zoneout: Regularizing RNNs by randomly preserving hidden activations. In Neural Information Processing Systems, 2016.
Andrew.L. Maas, Awni.Y. Hannun, Andrew.Y. Ng
Andrew L. Maas, Awni Y. Hannun, , and Andrew Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning, 2013.
Diederik. Kingma, Jimmy. Ba
Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. International Conference for Learning Representations, 2015.
Dan. Hendrycks, Kevin. Gimpel
Dan Hendrycks and Kevin Gimpel. Adjusting for dropout variance in batch normalization and weight initialization. In arXiv, 2016.
Dmytro. Mishkin, Jiri. Matas
Dmytro Mishkin and Jiri Matas. All you need is a good init. In International Conference on Learning Representations, 2016.
Andrew.M. Saxe, James.L. Mcclelland, Surya. Ganguli
Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations, 2014.
Guillaume. Desjardins, Karen. Simonyan, Razvan. Pascanu, Koray. Kavukcuoglu
Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, and Koray Kavukcuoglu. Natural neural networks. In arXiv, 2015.
Kevin. Gimpel, Nathan. Schneider, Brendan. O′′{}^{prime}start_Floatsuperscript ′ End_Floatsuperscriptconnor, Dipanjan. Das, Daniel. Mills, Jacob. Eisenstein, Michael. Heilman, Dani. Yogatama, Jeffrey. Flanigan, Noah.A. Smith
Kevin Gimpel, Nathan Schneider, Brendan O′′{}^{\prime}start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPTConnor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments. Association for Computational Linguistics (ACL), 2011.
Olutobi. Owoputi, O’. Brendan, Chris. Connor, Kevin. Dyer, Nathan. Gimpel, Noah.A. Schneider, None. Smith
Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A. Smith. Improved part-of-speech tagging for online conversational text with word clusters. In North American Chapter of the Association for Computational Linguistics (NAACL), 2013.
Abdelrahman. Mohamed, George.E. Dahl, Geoffrey.E. Hinton
Abdelrahman Mohamed, George E. Dahl, and Geoffrey E. Hinton. Acoustic modeling using deep belief networks. In IEEE Transactions on Audio, Speech, and Language Processing, 2012.
Nitish. Srivastava
Nitish Srivastava. Improving neural networks with dropout. In University of Toronto, 2013.
Alex. Krizhevsky
Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images, 2009.
Tim. Salimans, Diederik.P. Kingma
Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Neural Information Processing Systems, 2016.
Sergey. Zagoruyko, Nikos. Komodakis
Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. British Machine Vision Conference, 2016.
Ilya. Loshchilov, Frank. Hutter
Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with restarts. arXiv, 2016.
Anish. Shah, Sameer. Shinde, Eashan. Kadam, Hena. Shah, Sandip. Shingade
Anish Shah, Sameer Shinde, Eashan Kadam, Hena Shah, and Sandip Shingade. Deep residual networks with exponential linear unit. In Vision Net, 2016.
Amit Choudhury. A simple approximation to th...

References

Unlock full article access by joining Solve