Solve

Mixed Precision Training

0
Structured data

A study conducted in the UK from 2009 to 2010 by leading scientists explored neonatal resuscitation practices in various neonatal units, aiming to assess adherence to international guidelines and identify differences between tertiary and non-tertiary care providers...

One Sentence Abstract

This research presents a methodology for trainingCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. deep neural networks with half-precision floating-point numbers, which significantly1 reduces memory requirements and increases computational speed, while maintaining accuracy and not requiring modified hyper-parameters, by implementing three techniques to prevent loss of critical information.

Simplified Abstract

Researchers have developed a new method to train large, complex neural networks without needing as much memory or computing power. This is important because making a network bigger usually improves its accuracy, but it also requires more resources. The new technique, called "half-precision," uses less space for data storage while still maintaining high accuracy.

To ensure that the network stays accurate, the researchers suggest three strategies. First, they recommend keeping a full-precision copy of certain parts of the network. Second, they propose adjusting the loss scale to preserve small details. Lastly, they use a special type of math that combines both full and half-precision calculations.

The researchers show that this new method works well with various tasks and large models that have more than 100 million parameters. By using half-precision, networks can be trained more quickly and with less space needed, making it easier and faster for scientists to build smarter artificial intelligence systems.

Subheader 1

The researchers show that this new method works well with various tasks and large models that have more than 100 million parameters. By using half-precision, networks can be trained more quickly and with less space needed, making it easier and faster for scientists to build smarter artificial intelligence systems.

Subheader 2

The researchers show that this new method works well with various tasks and large models that have more than 100 million parameters. By using half-precision, networks can be trained more quickly and with less space needed, making it easier and faster for scientists to build smarter artificial intelligence systems.

Subheader 3

Study Fields

Main Field: Efficient Training of Deep Neural Networks

Subfields:

  1. Utilizing Half-Precision Floating Point Numbers
  2. Memory Requirements Reduction
  3. Speeding up Arithmetic
  4. Preventing Loss of Critical Information
  5. Single-Precision Copy of Weights
  6. Gradients Accumulation in Single-Precision
  7. Half-Precision Rounding
  8. Loss-Scaling
  9. Single-Precision Outputs and Half-Precision Conversion
  10. Model Architectures and Large Datasets

Study Objectives

Subheader 1

Subheader 2

  1. To explore the possibility of training deep neural networks using half-precision floating point numbers without losing model accuracy or modifying hyper-parameters.
  2. To reduce memory requirements by nearly half and speed up arithmetic on recent GPUs.
  3. To propose and test three techniques for preventing the loss of critical information when using half-precision format:
    • Maintaining a single-precision copy of weights that accumulates gradients after each optimizer step.
    • Loss-scaling to preserve gradient values with small magnitudes.
    • Half-precision arithmetic that accumulates into single-precision outputs, converted to half-precision before storage.
  4. To demonstrate the effectiveness of the proposed methodology across a wide range of tasks, large-scale model architectures (exceeding 100 million parameters), and large datasets.

Conclusions

  1. The authors present a methodology for training deep neural networks using half-precision floating point numbers, which significantly reduces memory requirements and increases arithmetic speed.
  2. They store weights, activations, and gradients in IEEE half-precision format, but recognize that this narrower range may lead to loss of critical information.
  3. To prevent such loss, they propose three techniques: a) maintaining a single-precision copy of weights, b) loss-scaling, and c) using half-precision arithmetic that accumulates into single-precision outputs.
  4. The methodology demonstrates effectiveness across various tasks, large-scale model architectures (over 100 million parameters), and big datasets.
  5. By nearly halving memory requirements and speeding up arithmetic, this methodology could improve the efficiency of training deep neural networks.

References

1
K. He, X. Zhang, S. Ren, J. Sun
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016a.
R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, Y. Wu
R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling, 2016. URL https://arxiv.org/pdf/1602.02410.pdfCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat..
Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos
D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of The 33rd International Conference on Machine Learning, pages 173–182, 2016.
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
M. Courbariaux, Y. Bengio, J.-P. David
M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3123–3131. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5647-binaryconnect-training-deep-neural-networks-with-binary-weights-during-propagations.pdfCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat..
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems, pages 4107–4115, 2016a.
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016b.
M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi
M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks, pages 525–542. Springer International Publishing, Cham, 2016. ISBN 978-3-319-46493-0. doi: 10.1007/978-3-319-46493-0˙32. URL https://doi.org/10.1007/978-3-319-46493-0\_32CardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat..
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, L. Fei-FeiInternational Journal of Computer Vision (IJCV)
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, Y. Zou
S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR, abs/1606.06160, 2016. URL http://arxiv.org/abs/1606.06160CardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat..
S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan
S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1737–1746, 2015.
Q. He, H. Wen, S. Zhou, Y. Wu, C. Yao, X. Zhou, Y. Zou
Q. He, H. Wen, S. Zhou, Y. Wu, C. Yao, X. Zhou, and Y. Zou. Effective quantization methods for recurrent neural networks. arXiv preprint arXiv:1611.10176, 2016c.
K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio
K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
S. Hochreiter, J. SchmidhuberNeural Comput
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, Nov. 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL http://dx.doi.org/10.1162/neco.1997.9.8.1735CardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat..
J. Ott, Z. Lin, Y. Zhang, S.-C. Liu, Y. Bengio
J. Ott, Z. Lin, Y. Zhang, S.-C. Liu, and Y. Bengio. Recurrent neural networks with limited numerical precision. arXiv preprint arXiv:1608.06902, 2016.
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S.E. Reed
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. Ssd: Single shot multibox detector. CoRR, abs/1512.02325, 2015a. URL http://dblp.uni-trier.de/db/journals/corr/corr1512.html#LiuAESR15CardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat..
Author(s) unknown
NVIDIA. Nvidia tesla v100 gpu architecture. https://images.nvidia.com/content/volta-architecture/pdf/Volta-Architecture-Whitepaper-v1.0.pdfCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat., 2017.
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. Devito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
A. Krizhevsky, I. Sutskever, G.E. Hinton
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdfCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat..
K. Simonyan, A. Zisserman
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015. URL http://arxiv.org/abs/1409.4842CardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat..
S. Ioffe, C. Szegedy
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. R. Bach and D. M. Blei, editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org, 2015. URL http://dblp.uni-trier.de/db/conf/icml/icml2015.html#IoffeS15CardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat..
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
K. He, X. Zhang, S. Ren, J. Sun
K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016b.
S. Ren, K. He, R. Girshick, J. Sun
S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural Information Processing Systems (NIPS), 2015.
R. Girshick
R. Girshick. Faster r-cnn github repository. https://github.com/rbgirshick/py-faster-rcnnCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat..
W. Liu
W. Liu. Ssd github repository. https://github.com/weiliu89/caffe/tree/ssdCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat..
A. Graves, S. Fernández, F. Gomez, J. Schmidhuber
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM, 2006.
None. Google
Google. Tensorflow tutorial: Sequence-to-sequence models. URL https://www.tensorflow.org/tutorials/seq2seqCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat..
A. Radford, L. Metz, S. Chintala
A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015. URL http://dblp.uni-trier.de/db/journals/corr/corr1511.html#RadfordMC15CardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat..
Z. Liu, P. Luo, X. Wang, X. Tang
Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015b.

References

Unlock full article access by joining Solve