[1] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,” in Proc. IEEE, 2023, doi: 10.1109/JPROC.2023.3238524.
[2] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), 2001, vol. 1, pp. I-I.
[3] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” J. Big Data, vol. 6, no. 1, pp. 1-48, 2019, doi: 10.1186/s40537-019-0197-0.
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, pp. 84-90, 2017, doi: 10.1145/3065386.
[5] S.S.A. Zaidi, M.S. Ansari, A. Aslam, N. Kanwal, M. Asghar, and B. Lee, “A survey of modern deep learning based object detection models,” Digit. Signal Process., vol. 126, p. 103514, 2022, doi: 10.1016/j.dsp.2022.103514.
[6] L. Liu et al., “Deep learning for generic object detection: A survey,” Int. J. Comput. Vis., vol. 128, pp. 261-318, 2020, doi: 10.1007/s11263-019-01247-4.
[7] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” Int. J. Comput. Vis., vol. 88, pp. 303-338, 2010, doi: 10.1007/s11263-009-0275-4.
[8] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results http://www.pascal-network.org/challenges,” in VOC/voc2007/workshop/index.html, 2007.
[9] M. Everingham and J. Winn, “The PASCAL visual object classes challenge 2012 (VOC2012) development kit,” Pattern Anal. Stat. Model. Comput. Learn. Tech. Rep., vol. 2007, no. 1-45, p. 5, 2012.
[10] O. Russakovsky et al., “Imagenet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, pp. 211-252, 2015, doi: 10.1007/s11263-015-0816-y.
[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248-255, doi: 10.1109/ CVPR.2009.5206848.
[12] T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in 13th European Conf. Comput. Vis. (ECCV), Zurich, Switzerland, 2014, Part V 13, 2014, pp. 740-755.
[13] A. Kuznetsova et al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” Int. J. Comput. Vis., vol. 128, no. 7, pp. 1956-1981, 2020, doi: 10.1007/s11263-020-01316-z.
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Adv. Neural Inf. Process. Syst., pp. 1097-1105, 2012.
[15] A. Radford et al., “Learning transferable visual models from natural language supervision,” in Int. Conf. Mach. Learn., 2021, pp. 8748-8763.
[16] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv Prepr. arXiv1409.1556, 2014.
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770-778.
[18] I.H. Sarker, “Machine learning: Algorithms, real-world applications and research directions,” SN Comput. Sci., vol. 2, no. 3, p. 160, 2021, doi: 10.1007/s42979-021-00592-x.
[19] L. Alzubaidi et al., “Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions,” J. Big Data, vol. 8, pp. 1-74, 2021, doi: 10.1186/s40537-021-00444-8.
[20] F. Chen, J. Wei, B. Xue, and M. Zhang, “Feature fusion and kernel selective in Inception-v4 network,” Appl. Soft Comput., vol. 119, p. 108582, 2022, doi: 10.1016/j.asoc.2022.108582.
[21] S. Khan, M. Naseer, M. Hayat, S.W. Zamir, F.S. Khan, and M. Shah, “Transformers in vision: A survey,” ACM Comput. Surv., vol. 54, no. 10s, pp. 1-41, 2022, doi: 10.1145/3505244.
[22] C.-Y. Wang, H.-Y.M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H. Yeh, “CSPNet: A new backbone that can enhance learning capability of CNN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Worksh., 2020, pp. 390-391.
[23] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos, “Image segmentation using deep learning: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3523-3542, 2021, doi: 10.1109/TPAMI.2021.3059968.
[24] C.-Y. Wang, A. Bochkovskiy, and H.-Y.M. Liao, “Scaled-yolov4: Scaling cross stage partial network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 13029-13038.
[25] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in Int. Conf. Mach. Learn., 2019, pp. 6105-6114.
[26] M. Tan et al., “Mnasnet: Platform-aware neural architecture search for mobile,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 2820-2828.
[27] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR’05), 2005, vol. 1, pp. 886-893.
[28] D.G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, pp. 91–110, 2004, doi: 10.1023/B:VISI.0000029664.99615.94.
[29] T. Nguyen, E.-A. Park, J. Han, D.-C. Park, and S.-Y. Min, “Object detection using scale invariant feature transform,” in Genetic and Evolutionary Computing: Proc. 7th Int. Conf. Genetic Evol. Comput. (ICGEC) Prague, Czech Republic, 2014, pp. 65-72.
[30] A. Mohan, C. Papageorgiou, and T. Poggio, “Example-based object detection in images by components,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 4, pp. 349–361, 2001, doi: 10.1109/34.917571.
[31] Y. Ke and R. Sukthankar, “PCA-SIFT: A more distinctive representation for local image descriptors,” in Proc. 2004 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR) 2004, vol. 2, pp. II-II.
[32] E. Hsiao, P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” in 2008 IEEE Conf. Comput. Vis. Pattern Recognit., 2008, pp. 1-8.
[33] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627-1645, 2009, doi: 10.1109/TPAMI.2009.167.
[34] P.F. Felzenszwalb, R.B. Girshick, and D. McAllester, “Cascade object detection with deformable part models,” in 2010 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2010, pp. 2241-2248.
[35] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 580-587.
[36] J.R.R. Uijlings, K.E.A. Van De Sande, T. Gevers, and A.W.M. Smeulders, “Selective search for object recognition,” Int. J. Comput. Vis., vol. 104, pp. 154–171, 2013, doi: 10.1007/s11263-013-0620-5.
[37] Y. LeCun et al., “Backpropagation applied to handwritten zip code recognition,” Neural Comput., vol. 1, no. 4, pp. 541-551, 1989, doi: 10.1162/neco.1989.1.4.541.
[38] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1904-1916, 2015, doi: 10.1109/TPAMI.2015.2389824.
[39] R. Girshick, “Fast r-cnn,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440-1448.
[40] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embedding,” in Proc. 22nd ACM Int. Conf. Multim., 2014, pp. 675-678.
[41] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Adv. Neural Inf. Process. Syst., vol. 28, 2015.
[42] W. Wang et al., “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 568-578.
[43] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8759-8768.
[44] G. Ghiasi, T.-Y. Lin, and Q. V Le, “Nas-fpn: Learning scalable feature pyramid architecture for object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 7036-7045.
[45] M. Tan, R. Pang, and Q. V Le, “Efficientdet: Scalable and efficient object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10781-10790.
[46] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully convolutional networks,” Adv. Neural Inf. Process. Syst., vol. 29, 2016.
[47] K. He and G. Gkioxari, “P. Doll ar, and R. Girshick,’Mask r-CNN,’” in Proc. IEEE Int. Conf. Comput. Vis, 2017, pp. 2980-2988.
[48] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1492-1500.
[49] K. Chen et al., “Hybrid task cascade for instance segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4974-4983.
[50] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6154-6162.
[51] S. Qiao, L.-C. Chen, and A. Yuille, “Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 10213-10224.
[52] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A.L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834-848, 2017.
[53] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian, “A real-time algorithm for signal analysis with the help of the wavelet transform,” in Wavelets: Time-Frequency Methods Phase Space Proc. Int. Conf., Marseille, France, 1987, pp. 286-297.
[54] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132-7141.
[55] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 779-788.
[56] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 7263-7271.
[57] L. Zhao and S. Li, “Object detection algorithm based on improved YOLOv3,” Electronics, vol. 9, no. 3, p. 537, 2020, doi: 10.3390/electronics9030537.
[58] A. Bochkovskiy, C.-Y. Wang, and H.-Y.M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv Prepr. arXiv2004.10934, 2020.
[59] O.E. Olorunshola, M.E. Irhebhude, and A.E. Evwiekpaefe, “A Comparative Study of YOLOv5 and YOLOv7 Object Detection Algorithms,” J. Comput. Soc. Informatics, vol. 2, no. 1, pp. 1-12, 2023, doi: 10.33736/jcsi.5070.2023.
[60] C. Huanjie et al., “SSD object detection algorithm with multi-scale convolution feature fusion,” J. Front. Comput. Sci. Technol., vol. 13, no. 6, p. 1049, 2019.
[61] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deep neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 2147-2154.
[62] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1026-1034.
[63] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2980-2988.
[64] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in 14th European Conf. Comput. Vis. (ECCV), Amsterdam, Netherlands, 2016, Part VIII 14, 2016, pp. 483-499.
[65] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convolutional neural networks: analysis, applications, and prospects,” IEEE Trans. neural networks Learn. Syst., vol. 33, no. 12, pp. 6999-7019, 2021, doi 202110.1109/TNNLS.2021.3084827.
[66] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv Prepr. arXiv1608.03983, 2016.
[67] D. Misra, “Mish: A self regularized non-monotonic activation function,” arXiv Prepr. arXiv1908.08681, 2019.
[68] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Comput. Surv., vol. 55, no. 9, pp. 1-35, 2023, doi: 10.5555/3455716.3455856
[69] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proc. naacL-HLT, 2019, vol. 1, p. 2, doi: 10.48550/arXiv.1810.04805.
[70] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018, https://www.mikecaptain.com/resources/pdf/GPT-1.pdf.
[71] C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 1, pp. 5485-5551, 2020, doi: 10.5555/3455716.3455856.
[72] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 10012-10022.
[73] A.G. Howard et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv Prepr. arXiv1704.04861, 2017, doi: 10.48550/arXiv.1704.04861.
[74] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4510-4520.
[75] A. Howard et al., “Searching for mobilenetv3,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 1314-1324.
[76] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6848-6856.
[77] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proc. European Conf. Comput. Vis. (ECCV), 2018, pp. 116-131.
[78] R.J. Wang, X. Li, and C. X. Ling, “Pelee: A real-time object detection system on mobile devices,” Adv. Neural Inf. Process. Syst., vol. 31, 2018.
[79] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue, “Dsod: Learning deeply supervised object detectors from scratch,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 1919-1927.
[80] C. Liu et al., “Progressive neural architecture search,” in Proc. European Conf. Comput. Vis. (ECCV), 2018, pp. 19-34.
[81] T.-J. Yang et al., “Netadapt: Platform-aware neural network adaptation for mobile applications,” in Proc. European Conf. Comput. Vis. (ECCV), 2018, pp. 285-300.