Res2Net: A New Multi-scale Backbone Architecture

0
Structured data

A study conducted in the UK from 2009 to 2010 by leading scientists explored neonatal resuscitation practices in various neonatal units, aiming to assess adherence to international guidelines and identify differences between tertiary and non-tertiary care providers...

Read on arXivCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

One Sentence Abstract

The Res2Net building block is proposed for CNNs, enhancing multi-scale feature representation by constructing hierarchical residual-like connections within a single residual block, resulting in consistent performance gains over baseline models on various vision tasks and datasets.

Simplified Abstract

Researchers are trying to improve how computers identify and understand different objects in images. They've found that using a technique called 'Res2Net' makes their computer models better at recognizing objects at various scales.

Traditional methods have a hard time handling multiple scales, but 'Res2Net' is a clever way to connect different parts of the computer model so that it can see the whole picture better. This new technique works by creating a network that mimics how our eyes see things, allowing the computer to better understand the details of the image.

This 'Res2Net' method can be added to some of the best existing models, such as ResNet, ResNeXt, and DLA, and it consistently improves their performance on different tasks. The researchers tested 'Res2Net' on a variety of images, including common objects and complex scenes, and found that it consistently outperforms its competitors.

In summary, this study introduces a new technique called 'Res2Net' that significantly improves a computer's ability to recognize and understand objects in images. By incorporating this technique into existing models, researchers can expect better and more accurate results in a variety of tasks, such as object detection and salient object detection.

Study Fields

Main fields:

  • Convolutional Neural Networks (CNNs)
  • Multi-scale representation
  • Computer Vision Tasks

Subfields:

  • Backbone CNNs
  • Layer-wise representation of multi-scale features
  • Residual-like connections
  • Hierarchical connections within residual blocks
  • Receptive fields
  • Model Evaluation
  • Datasets (CIFAR-100, ImageNet)
  • Object Detection
  • Class Activation Mapping
  • Salient Object Detection

Study Objectives

  • Investigate the importance of representing features at multiple scales in various vision tasks.
  • Examine the multi-scale representation ability of recent advances in backbone convolutional neural networks (CNNs).
  • Identify limitations of existing methods in representing multi-scale features in a layer-wise manner.
  • Propose a novel building block for CNNs, named Res2Net, that constructs hierarchical residual-like connections within one single residual block.
  • Demonstrate how Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.
  • Show that the Res2Net block can be plugged into state-of-the-art backbone CNN models, such as ResNet, ResNeXt, and DLA.
  • Evaluate the performance of Res2Net on these models and demonstrate consistent performance gains over baseline models on widely-used datasets like CIFAR-100 and ImageNet.
  • Conduct ablation studies and experimental results on representative computer vision tasks, such as object detection, class activation mapping, and salient object detection, to verify the superiority of Res2Net over state-of-the-art baseline methods.
  • Share the source code and trained models on a publicly available website (https://mmcheng.net/res2net/CardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.).

Conclusions

  • The paper proposes a novel building block for CNNs called Res2Net, which improves the multi-scale feature representation at a granular level by constructing hierarchical residual-like connections within a single residual block.
  • The Res2Net block increases the range of receptive fields for each network layer, enabling it to be integrated into state-of-the-art backbone CNN models such as ResNet, ResNeXt, and DLA.
  • The authors demonstrate consistent performance gains over baseline models on widely-used datasets like CIFAR-100 and ImageNet.
  • Ablation studies and experimental results on computer vision tasks like object detection, class activation mapping, and salient object detection further confirm the superiority of the Res2Net over state-of-the-art baseline methods.
  • The source code and trained models are available at https://mmcheng.net/res2net/CardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat..

References

A. Krizhevsky, I. Sutskever, G.E A. Hinton, I. Krizhevsky, G.E. Sutskever, None. Hinton
A. Krizhevsky, I. Sutskever, and G. E. Hinton. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Imagenet classification with deep convolutional neural networks. In Adv. Neural Inform. Process. Syst., pages 1097–1105, 2012. In Adv. Neural Inform. Process. Syst., pages 1097–1105, 2012.
S. Ren, K. He, R. Girshick, J.S. Sun, K. Ren, R. He, J. Girshick, None. Sun
S. Ren, K. He, R. Girshick, and J. Sun. S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Faster r-cnn: Towards real-time object detection with region proposal networks. In Adv. Neural Inform. Process. Syst., pages 91–99, 2015. In Adv. Neural Inform. Process. Syst., pages 91–99, 2015.
R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Int. Conf. Comput. Vis., pages 618–626, 2017. In Int. Conf. Comput. Vis., pages 618–626, 2017.
T. Zhang, C. Xu, M.-H.T. Yang, C. Zhang, M.-H. Xu, None. Yang
T. Zhang, C. Xu, and M.-H. Yang. T. Zhang, C. Xu, and M.-H. Yang. Multi-task correlation particle filter for robust object tracking. Multi-task correlation particle filter for robust object tracking. In IEEE Conf. Comput. Vis. Pattern Recog., 2017. In IEEE Conf. Comput. Vis. Pattern Recog., 2017.
K. Simonyan, A.K. Zisserman, A. Simonyan, None. Zisserman
K. Simonyan and A. Zisserman. K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. Two-stream convolutional networks for action recognition in videos. In Adv. Neural Inform. Process. Syst., pages 568–576, 2014. In Adv. Neural Inform. Process. Syst., pages 568–576, 2014.
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L L. Yuille, G. -C. Chen, I. Papandreou, K. Kokkinos, A.L. Murphy, None. YuilleIEEE Trans. Pattern Anal. Mach. Intell
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):834–848, 2018. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):834–848, 2018.
Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, P.Q. Torr, M.-M. Hou, X. Cheng, A. Hu, Z. Borji, P. Tu, None. TorrIEEE Trans. Pattern Anal. Mach. Intell
Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr. Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr. Deeply supervised salient object detection with short connections. Deeply supervised salient object detection with short connections. IEEE Trans. Pattern Anal. Mach. Intell., 41(4):815–828, 2019. IEEE Trans. Pattern Anal. Mach. Intell., 41(4):815–828, 2019.
A. Borji, M.-M. Cheng, Q. Hou, H. Jiang, J.A. Li, M.-M. Borji, Q. Cheng, H. Hou, J. Jiang, None. LiComputational Visual Media
A. Borji, M.-M. Cheng, Q. Hou, H. Jiang, and J. Li. A. Borji, M.-M. Cheng, Q. Hou, H. Jiang, and J. Li. Salient object detection: A survey. Salient object detection: A survey. Computational Visual Media, 5(2):117–150, 2019. Computational Visual Media, 5(2):117–150, 2019.
M.-M. Cheng, Y. Liu, W.-Y. Lin, Z. Zhang, P.L. Rosin, P.H S. Torr, M.-M. Cheng, Y. Liu, W.-Y. Lin, Z. Zhang, P.L. Rosin, P.H S. TorrComputational Visual Media
M.-M. Cheng, Y. Liu, W.-Y. Lin, Z. Zhang, P. L. Rosin, and P. H. S. Torr. M.-M. Cheng, Y. Liu, W.-Y. Lin, Z. Zhang, P. L. Rosin, and P. H. S. Torr. Bing: Binarized normed gradients for objectness estimation at 300fps. Bing: Binarized normed gradients for objectness estimation at 300fps. Computational Visual Media, 5(1):3–20, Mar 2019. Computational Visual Media, 5(1):3–20, Mar 2019.
K. Zhao, W. Shen, S. Gao, D. Li, M.-M.K. Cheng, W. Zhao, S. Shen, D. Gao, M.-M. Li, None. Cheng
K. Zhao, W. Shen, S. Gao, D. Li, and M.-M. Cheng. K. Zhao, W. Shen, S. Gao, D. Li, and M.-M. Cheng. Hi-Fi: Hierarchical feature integration for skeleton detection. Hi-Fi: Hierarchical feature integration for skeleton detection. In Int. Joint Conf. Artif. Intell., 2018. In Int. Joint Conf. Artif. Intell., 2018.
G.-Y. Nie, M.-M. Cheng, Y. Liu, Z. Liang, D.-P. Fan, Y. Liu, Y.G. Wang, M.-M. -Y. Nie, Y. Cheng, Z. Liu, D.-P. Liang, Y. Fan, Y. Liu, None. Wang
G.-Y. Nie, M.-M. Cheng, Y. Liu, Z. Liang, D.-P. Fan, Y. Liu, and Y. Wang. G.-Y. Nie, M.-M. Cheng, Y. Liu, Z. Liang, D.-P. Fan, Y. Liu, and Y. Wang. Multi-level context ultra-aggregation for stereo matching. Multi-level context ultra-aggregation for stereo matching. In...

References

Unlock full article access by joining Solve