بررسی مدل‌های تشخیص شی مبتنی بر یادگیری عمیق

نوع مقاله : مقاله مروری

نویسندگان

1 دانشکده کامپیوتر، دانشگاه صنعتی شاهرود، شاهرود، ایران.

2 دانشکده کامپیوتر، دانشگاه علم و فناوری مازندران، بهشهر، ایران.

چکیده

تشخیص شی وظیفه طبقه‌بندی و مکان‌یابی اشیا در یک تصویر یا ویدئو را بر عهده دارد که در سال های اخیر به دلیل کاربردهای گسترده آن شهرت یافته است. این مقاله پیشرفت‌های اخیر در بازشناسی شی مبتنی بر یادگیری عمیق را بررسی می‌کند. مرور کلی مجموعه داده‌های معیار و معیارهای ارزیابی مورد استفاده در شناسایی نیز همراه با برخی از معماری‌های اصلی مورد استفاده در مسئله بازشناسی شی ارائه شده‌است. همچنین مدل‌های طبقه‌بندی سبک‌وزن مدرن مورد استفاده بررسی شده اند. در نهایت، عملکرد این ساختارها را بر روی معیارهای چندگانه مقایسه شده است.

کلیدواژه‌ها

موضوعات


عنوان مقاله [English]

A review on object recognition models based on deep learning

نویسندگان [English]

  • Mohsen Norouzi 1
  • Hamid Hassanpour 1
  • Ali Ghanbari 2
1 Faculty of Computer Engineering, Shahrood University of Technology, Shahrood, Iran.
2 Department of Computer Engineering, University of Science and Technology of Mazandaran, Behshahr, Iran.
چکیده [English]

Object detection plays a crucial role in identifying and pinpointing objects within images or videos, gaining popularity for its diverse applications in recent times. This document examines the latest progress in object recognition through deep learning. It includes an examination of benchmark datasets and evaluation standards commonly employed in recognition tasks, along with an exploration of key architectures addressing object recognition challenges. Additionally, a scrutiny of contemporary lightweight classification models is provided. To conclude, a standard comparison of the state-of-the-art approaches from different aspects is conducted across various criteria.

کلیدواژه‌ها [English]

  • Object detection and identification
  • convolutional neural networks (CNN)
  • lightweight networks
  • deep learning
  • Image Processing
[1] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,” in Proc. IEEE, 2023, doi: 10.1109/JPROC.2023.3238524.
[2] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), 2001, vol. 1, pp. I-I.
[3] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” J. Big Data, vol. 6, no. 1, pp. 1-48, 2019, doi: 10.1186/s40537-019-0197-0.
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, pp. 84-90, 2017, doi: 10.1145/3065386.
[5] S.S.A. Zaidi, M.S. Ansari, A. Aslam, N. Kanwal, M. Asghar, and B. Lee, “A survey of modern deep learning based object detection models,” Digit. Signal Process., vol. 126, p. 103514, 2022, doi: 10.1016/j.dsp.2022.103514.
[6] L. Liu et al., “Deep learning for generic object detection: A survey,” Int. J. Comput. Vis., vol. 128, pp. 261-318, 2020, doi: 10.1007/s11263-019-01247-4.
[7] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” Int. J. Comput. Vis., vol. 88, pp. 303-338, 2010, doi: 10.1007/s11263-009-0275-4.
[8] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results http://www.pascal-network.org/challenges,” in VOC/voc2007/workshop/index.html, 2007.
[9] M. Everingham and J. Winn, “The PASCAL visual object classes challenge 2012 (VOC2012) development kit,” Pattern Anal. Stat. Model. Comput. Learn. Tech. Rep., vol. 2007, no. 1-45, p. 5, 2012.
[10] O. Russakovsky et al., “Imagenet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, pp. 211-252, 2015, doi: 10.1007/s11263-015-0816-y.
[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248-255, doi: 10.1109/ CVPR.2009.5206848.
[12] T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in 13th European Conf. Comput. Vis. (ECCV), Zurich, Switzerland, 2014, Part V 13, 2014, pp. 740-755.
[13] A. Kuznetsova et al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” Int. J. Comput. Vis., vol. 128, no. 7, pp. 1956-1981, 2020, doi: 10.1007/s11263-020-01316-z.
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Adv. Neural Inf. Process. Syst., pp. 1097-1105, 2012.
[15] A. Radford et al., “Learning transferable visual models from natural language supervision,” in Int. Conf. Mach. Learn., 2021, pp. 8748-8763.
[16] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv Prepr. arXiv1409.1556, 2014.
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770-778.
[18] I.H. Sarker, “Machine learning: Algorithms, real-world applications and research directions,” SN Comput. Sci., vol. 2, no. 3, p. 160, 2021, doi: 10.1007/s42979-021-00592-x.
[19] L. Alzubaidi et al., “Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions,” J. Big Data, vol. 8, pp. 1-74, 2021, doi: 10.1186/s40537-021-00444-8.
[20] F. Chen, J. Wei, B. Xue, and M. Zhang, “Feature fusion and kernel selective in Inception-v4 network,” Appl. Soft Comput., vol. 119, p. 108582, 2022, doi: 10.1016/j.asoc.2022.108582.
[21] S. Khan, M. Naseer, M. Hayat, S.W. Zamir, F.S. Khan, and M. Shah, “Transformers in vision: A survey,” ACM Comput. Surv., vol. 54, no. 10s, pp. 1-41, 2022, doi: 10.1145/3505244.
[22] C.-Y. Wang, H.-Y.M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H. Yeh, “CSPNet: A new backbone that can enhance learning capability of CNN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Worksh., 2020, pp. 390-391.
[23] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos, “Image segmentation using deep learning: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3523-3542, 2021, doi:  10.1109/TPAMI.2021.3059968.
[24] C.-Y. Wang, A. Bochkovskiy, and H.-Y.M. Liao, “Scaled-yolov4: Scaling cross stage partial network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 13029-13038.
[25] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in Int. Conf. Mach. Learn., 2019, pp. 6105-6114.
[26] M. Tan et al., “Mnasnet: Platform-aware neural architecture search for mobile,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 2820-2828.
[27] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR’05), 2005, vol. 1, pp. 886-893.
[28] D.G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, pp. 91–110, 2004, doi: 10.1023/B:VISI.0000029664.99615.94.
[29] T. Nguyen, E.-A. Park, J. Han, D.-C. Park, and S.-Y. Min, “Object detection using scale invariant feature transform,” in Genetic and Evolutionary Computing: Proc. 7th Int. Conf. Genetic Evol. Comput. (ICGEC) Prague, Czech Republic, 2014, pp. 65-72.
[30] A. Mohan, C. Papageorgiou, and T. Poggio, “Example-based object detection in images by components,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 4, pp. 349–361, 2001, doi: 10.1109/34.917571.
[31] Y. Ke and R. Sukthankar, “PCA-SIFT: A more distinctive representation for local image descriptors,” in Proc. 2004 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR) 2004, vol. 2, pp. II-II.
[32] E. Hsiao, P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” in 2008 IEEE Conf.  Comput. Vis. Pattern Recognit., 2008, pp. 1-8.
[33] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627-1645, 2009, doi: 10.1109/TPAMI.2009.167.
[34] P.F. Felzenszwalb, R.B. Girshick, and D. McAllester, “Cascade object detection with deformable part models,” in 2010 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2010, pp. 2241-2248.
[35] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 580-587.
[36] J.R.R. Uijlings, K.E.A. Van De Sande, T. Gevers, and A.W.M. Smeulders, “Selective search for object recognition,” Int. J. Comput. Vis., vol. 104, pp. 154–171, 2013, doi: 10.1007/s11263-013-0620-5.
[37] Y. LeCun et al., “Backpropagation applied to handwritten zip code recognition,” Neural Comput., vol. 1, no. 4, pp. 541-551, 1989, doi: 10.1162/neco.1989.1.4.541.
[38] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1904-1916, 2015, doi: 10.1109/TPAMI.2015.2389824.
[39] R. Girshick, “Fast r-cnn,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440-1448.
[40] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embedding,” in Proc. 22nd ACM Int. Conf. Multim., 2014, pp. 675-678.
[41] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Adv. Neural Inf. Process. Syst., vol. 28, 2015.
[42] W. Wang et al., “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 568-578.
[43] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8759-8768.
[44] G. Ghiasi, T.-Y. Lin, and Q. V Le, “Nas-fpn: Learning scalable feature pyramid architecture for object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 7036-7045.
[45] M. Tan, R. Pang, and Q. V Le, “Efficientdet: Scalable and efficient object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10781-10790.
[46] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully convolutional networks,” Adv. Neural Inf. Process. Syst., vol. 29, 2016.
[47] K. He and G. Gkioxari, “P. Doll ar, and R. Girshick,’Mask r-CNN,’” in Proc. IEEE Int. Conf. Comput. Vis, 2017, pp. 2980-2988.
[48] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1492-1500.
[49] K. Chen et al., “Hybrid task cascade for instance segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4974-4983.
[50] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6154-6162.
[51] S. Qiao, L.-C. Chen, and A. Yuille, “Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 10213-10224.
[52] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A.L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834-848, 2017.
[53] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian, “A real-time algorithm for signal analysis with the help of the wavelet transform,” in Wavelets: Time-Frequency Methods Phase Space Proc. Int. Conf., Marseille, France, 1987, pp. 286-297.
[54] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132-7141.
[55] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 779-788.
[56] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 7263-7271.
[57] L. Zhao and S. Li, “Object detection algorithm based on improved YOLOv3,” Electronics, vol. 9, no. 3, p. 537, 2020, doi: 10.3390/electronics9030537.
[58] A. Bochkovskiy, C.-Y. Wang, and H.-Y.M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv Prepr. arXiv2004.10934, 2020.
[59] O.E. Olorunshola, M.E. Irhebhude, and A.E. Evwiekpaefe, “A Comparative Study of YOLOv5 and YOLOv7 Object Detection Algorithms,” J. Comput. Soc. Informatics, vol. 2, no. 1, pp. 1-12, 2023, doi: 10.33736/jcsi.5070.2023.
[60] C. Huanjie et al., “SSD object detection algorithm with multi-scale convolution feature fusion,” J. Front. Comput. Sci. Technol., vol. 13, no. 6, p. 1049, 2019.
[61] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deep neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 2147-2154.
[62] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1026-1034.
[63] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2980-2988.
[64] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in 14th European Conf. Comput. Vis. (ECCV), Amsterdam, Netherlands, 2016, Part VIII 14, 2016, pp. 483-499.
[65] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convolutional neural networks: analysis, applications, and prospects,” IEEE Trans. neural networks Learn. Syst., vol. 33, no. 12, pp. 6999-7019, 2021, doi 202110.1109/TNNLS.2021.3084827.
[66] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv Prepr. arXiv1608.03983, 2016.
[67] D. Misra, “Mish: A self regularized non-monotonic activation function,” arXiv Prepr. arXiv1908.08681, 2019.
[68] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Comput. Surv., vol. 55, no. 9, pp. 1-35, 2023, doi: 10.5555/3455716.3455856
[69] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proc. naacL-HLT, 2019, vol. 1, p. 2, doi: 10.48550/arXiv.1810.04805.
[70] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018, https://www.mikecaptain.com/resources/pdf/GPT-1.pdf.
[71] C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 1, pp. 5485-5551, 2020, doi: 10.5555/3455716.3455856.
[72] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 10012-10022.
[73] A.G. Howard et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv Prepr. arXiv1704.04861, 2017, doi: 10.48550/arXiv.1704.04861.
[74] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4510-4520.
[75] A. Howard et al., “Searching for mobilenetv3,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 1314-1324.
[76] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6848-6856.
[77] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proc. European Conf. Comput. Vis. (ECCV), 2018, pp. 116-131.
[78] R.J. Wang, X. Li, and C. X. Ling, “Pelee: A real-time object detection system on mobile devices,” Adv. Neural Inf. Process. Syst., vol. 31, 2018.
[79] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue, “Dsod: Learning deeply supervised object detectors from scratch,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 1919-1927.
[80] C. Liu et al., “Progressive neural architecture search,” in Proc. European Conf. Comput. Vis. (ECCV), 2018, pp. 19-34.
[81] T.-J. Yang et al., “Netadapt: Platform-aware neural network adaptation for mobile applications,” in Proc. European Conf. Comput. Vis. (ECCV), 2018, pp. 285-300.