پیش‌بینی کارآیی برنامه‌های منظم کودا از طریق روش‌های یادگیری ماشین

نوع مقاله : مقاله پژوهشی

نویسندگان

1 گروه مهندسی کامپیوتر، دانشکده مهندسی برق و کامپیوتر، دانشگاه علم و فناوری مازندران، بهشهر، ایران

2 گروه مهندسی کامپیوتر، دانشکده مهندسی، دانشگاه فردوسی مشهد، مشهد، ایران

چکیده

پیش‌بینی کارآیی پردازنده‌های گرافیکی یک چالش برای برنامه‌نویسان کودا است و در تنظیم برنامه‌ها کاربرد وسیعی دارد چراکه سربارهای برنامه‌های کودا ممکن است باعث شوند که اجرای برنامه‌ها روی پردازنده گرافیکی مقرون به صرفه نباشند. پیش‌بینی کارآیی به برنامه‌نویس کمک می‌کند تا تنظیم برنامه‌ها را آگاهانه‌تر انجام دهند و به جای آزمون و خطا، نقطه بهینه تنظیم را دقیق‌تر پیدا کند. وجود یک مدل که بتواند کارآیی را پیش‌بینی کند، می‌تواند به برنامه‌نویسان در تنظیم دستی و به توسعه‌دهندگان کامپایلرهای مبدا به مبدا در تنظیم خودکار برنامه‌های کودا کمک کند. در این مقاله، ما یک مدل کارآیی برای پیش‌بینی زمان اجرای یک کرنل کودا ارائه کرده‌ایم. در این مدل، ابتدا با استفاده از پروفایلینگ و با استفاده از تحلیل ایستای کدهای کودا و PTX، دو مجموعه داده برای زمان اجرای کرنل‌های محاسبه‌گرا و حافظه‌گرا ایجاد شده است. سپس با استفاده از روش‌های شبکه عصبی مصنوعی، ماشین بردار پشتیبان، و ماشین یادگیری افراطی، زمان اجرای کرنل کودا پیش‌بینی شده است. نتایج آزمایشگاهی نشان می‌دهند که روش ماشین یادگیری افراطی می‌تواند زمان اجرای یک کرنل محاسبه‌گرا را با حداکثر خطای 3.42 درصد و زمان اجرای یک کرنل حافظه‌گرا را با حداکثر خطای 9.84 درصد پیش‌بینی کند. برای اعتبارسنجی بهتر نتایج از روش اعتبارسنجی متقاطع استفاده شده است. همچنین با استفاده از روش‌های انتخاب ویژگی، میزان تاثیر هر یک از ویژگی‌های ورودی روی زمان اجرای کرنل مشخص شده است.

کلیدواژه‌ها

موضوعات


عنوان مقاله [English]

Predicting the Performance of Regular CUDA Programs by Machine Learning Methods

نویسندگان [English]

  • Ali Riahi 1
  • Abdorreza Savadi 2
  • Mahmoud Naghibzadeh 2
1 Department of Computer Engineering, University of Science and Technology of Mazandaran, Behshahr, Iran
2 Computer Engineering Department, Ferdowsi University of Mashhad, Mashhad, Iran
چکیده [English]

Performance predicting of GPUs is a challenge for CUDA programmers and has a wide application in the tuning of these programs. This is because the overheads of CUDA programs may make running programs on a GPU uneconomical. Performance prediction helps the programmer to tune the programs more consciously and find the optimal point of the tuning more accurately instead of trial and error. Having a model that can predict performance can help programmers in manual tuning and source-to-source compiler developers in auto-tuning CUDA programs. In this paper, we presented a performance model for predicting the execution time of a CUDA kernel. In this model, first by using profiling and by using the static analysis of CUDA and PTX codes, two datasets have been created for the execution time of computation-bound and memory-bound kernels. Then, using the methods of the artificial neural network, support vector machine, and extreme learning machine, the execution time of the CUDA kernel is predicted. The experimental results show that the extreme learning machine method can predict the execution time of a computation-bound kernel with a maximum error of 3.42% and the execution time of a memory- bound kernel with a maximum error of 9.84%. The cross-validation method was used to validate the results. Also, by using feature selection methods, the impact of each input feature on kernel execution time has been determined.

کلیدواژه‌ها [English]

  • GPU
  • CUDA
  • Performance Model
  • Execution Time Prediction
  • Machine Learning
  • Feature Selection
[1] NVIDIA, “CUDA C programming guide, version 10.1,” NVIDIA Corp., Santa Clara, CA, USA, 2019.
[2] S. Liu et al., “Fastgr: Global routing on CPU-GPU with heterogeneous task graph scheduler,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 42, no. 7, pp. 2317-2330, 2022, doi: 10.1109/TCAD.2022.3217668.
[3] F. Zhang et al., “Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors,” CCF Trans. High Perform. Comput., vol. 1, pp. 131-143, 2019, doi: 10.1007/s42514-019-00008-6.
[4] R. Schoonhoven, B. van Werkhoven, and K.J. Batenburg, “Benchmarking optimization algorithms for auto-tuning GPU kernels,” IEEE Trans. Evol. Comput., vol. 27, no. 3, pp. 550-564, 2023, doi: 10.1109/TEVC.2022.3210654.
[5] D. Mustafa, “A survey of performance tuning techniques and tools for parallel applications,” IEEE Access, vol. 10, pp. 15036-15055, 2022, doi: 10.1109/ACCESS.2022.3147846.
[6] J. Lemeire, J.G. Cornelis, and E. Konstantinidis, “Analysis of the analytical performance models for GPUs and extracting the underlying pipeline model,” J. Parallel Distrib. Comput., vol. 173, pp. 32-47, 2023, doi: 10.1016/j.jpdc.2022.11.002.
[7] M. Lattuada, E. Gianniti, D. Ardagna, and L. Zhang, “Performance prediction of deep learning applications training in GPU as a service systems,” Cluster Comput., vol. 25, no. 3, pp. 1279-1302, 2022, doi: 10.1007/s10586-021-03428-8.
[8] N. Weber and M. Goesele, “MATOG: Array layout auto-tuning for CUDA,” ACM Trans. Archit. Code Optim., vol. 14, no. 3, pp. 1-26, 2017, doi: 10.1145/3106341.
[9] M.A. Al-Mouhamed, A.H. Khan, and N. Mohammad, “A review of CUDA optimization techniques and tools for structured grid computing,” Computing, vol. 102, no. 4, pp. 977-1003, 2020, doi: 10.1007/s00607-019-00744-1.
[10] S. Madougou et al., “The landscape of GPGPU performance modeling tools,” Parallel Comput., vol. 56, pp. 18-33, 2016, doi: 10.1016/j.parco.2016.04.002.
[11] A. Reuther et al., “Survey and benchmarking of machine learning accelerators,” in Proc. IEEE High Perform. Extreme Comput. Conf. (HPEC), 2019, doi: 10.1109/HPEC.2019.8916327.
[12] A. Riahi, A. Savadi, and M. Naghibzadeh, “Comparison of analytical and ML-based models for predicting CPU-GPU data transfer time,” Computing, vol. 102, no. 9, pp. 2099-2116, 2020, doi: 10.1007/s00607-019-00780-x.
[13] J. Peddie, “What is a GPU?” in The History of the GPU-Steps to Invention. Cham, Switzerland: Springer, 2023, pp. 333-345, doi: 10.1007/978-3-031-10968-3_7.
[14] J. Wang, W. Pang, C. Weng, and A. Zhou, “D-Cubicle: Boosting data transfer dynamically for large-scale analytical queries in single-GPU systems,” Front. Comput. Sci., vol. 17, no. 4, 2023, doi: 10.1007/s11704-022-2160-z.
[15] NVIDIA, “CUDA C++ programming guide, version 11.7,” NVIDIA Corp., Santa Clara, CA, USA, 2022.
[16] NVIDIA, “CUDA C++ programming guide, version 12.1,” NVIDIA Corp., Santa Clara, CA, USA, 2023.
[17] NVIDIA, “Fermi whitepaper,” NVIDIA Corp., Santa Clara, CA, USA, 2009.
[18] NVIDIA, “GTX 680 whitepaper,” NVIDIA Corp., Santa Clara, CA, USA, 2012.
[19] NVIDIA, “980: Featuring Maxwell, the most advanced GPU ever made,” NVIDIA Corp., Santa Clara, CA, USA, 2014.
[20] NVIDIA, “Whitepaper NVIDIA GeForce GTX 1080 gaming perfected,” NVIDIA Corp., Santa Clara, CA, USA, 2016.
[21] NVIDIA, “NVIDIA Turing GPU architecture: Graphics reinvented,” NVIDIA Corp., Santa Clara, CA, USA, 2018.
[22] NVIDIA, “Tesla V100 GPU architecture: The world’s most advanced datacenter GPU,” NVIDIA Corp., Santa Clara, CA, USA, 2017.
[23] P.N. Glaskowsky, “NVIDIA’s Fermi: The first complete GPU computing architecture,” White paper 18, 2009.
[24] NVIDIA, “CUDA C++ programming guide, version 11.7,” NVIDIA Corp., Santa Clara, CA, USA, 2022.
[25] M.P. Akbarpour, K. Khamforoosh, and V. Maihami, “An approach to Improve Particle Swarm Optimization Algorithm Using CUDA,” Soft Comput. J., vol. 8, no. 2, pp. 2-21, 2020, doi: 10.22052/8.2.2 [In Persian].
[26] S. Ghike et al., “Directive-based compilers for GPUs,” in Proc. 27th Int. Worksh. Lang. Compilers Parallel Comput. (LCPC), 2015, doi: 10.1007/978-3-319-17473-0_2.
[27] U. Lopez-Novoa, A. Mendiburu, and J. Miguel-Alonso, “A survey of performance modeling and simulation techniques for accelerator-based computing,” IEEE Trans. Parallel Distrib. Syst., vol. 26, no. 1, pp. 272-281, 2015, doi: 10.1109/TPDS.2014.2308216.
[28] K. O’Neal et al., “HALWPE: Hardware-assisted light weight performance estimation for GPUs,” in Proc. 54th Annu. Des. Autom. Conf. (DAC), 2017, doi: 10.1145/3061639.306225.
[29] G. Wu, J.L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou, “GPGPU performance and power estimation using machine learning,” in Proc. 21st Int. Symp. High Perform. Comput. Archit. (HPCA), 2015, pp. 564-576, doi: 10.1109/HPCA.2015.7056063.
[30] J.-C. Huang, J.H. Lee, H. Kim, and H.-H.S. Lee, “GPUMech: GPU performance modeling technique based on interval analysis,” in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), 2014, pp. 268-279, doi: 10.1109/MICRO.2014.59.
[31] A. Resios and V. Holdermans, “GPU performance prediction using parametrized models,” Utrecht Univ., Utrecht, Netherlands, 2011.
[32] M. Amaris, R.Y. de Camargo, M. Dyab, A. Goldman, and D. Trystram, “A comparison of GPU execution time prediction using machine learning and analytical modeling,” in Proc. 15th Int. Symp. Netw. Comput. Appl. (NCA), 2016, pp. 326-333, doi: 10.1109/NCA.2016.7778637.
[33] M. Amaris, D. Cordeiro, A. Goldman, R.Y. de Camargo, “A simple BSP-based model to predict execution time in GPU applications,” in Proc. 22nd Int. Conf. High Perform. Comput. (HiPC), 2015, pp.285-294, doi: 10.1109/HiPC.2015.34.
[34] J. Yun et al., “A novel performance prediction model for mobile GPUs,” IEEE Access, vol. 6, pp. 16235-16245, 2018, doi: 10.1109/ACCESS.2018.2816040.
[35] T. Remmelg et al., “High-level hardware feature extraction for GPU performance prediction of stencils,” in Proc. 13th Annu. Worksh. Gen. Purpose Process. Graphics Process. Unit (GPGPU), 2020, doi: 10.1145/3366428.3380769.
[36] H. Bouzidi et al., “Performance prediction for convolutional neural networks on edge GPUs,” in Proc. 18th ACM Int. Conf. Comput. Frontiers (CF), 2021, doi: 10.1145/3457388.3458666.
[37] K. O’neal et al., “GPU performance estimation using software rasterization and machine learning,” ACM Trans. Embedded Comput. Syst., vol. 16, no. 5, pp. 1-21, 2017, doi: 10.1145/3126557.
[38] A. Benatia, W. Ji, Y. Wang, and F. Shi, “Sparse matrix format selection with multiclass SVM for SpMV on GPU,” in Proc. 45th Int. Conf. Parallel Process. (ICPP), 2016, pp. 496-505, doi: 10.1109/ICPP.2016.64.
[39] T.T. Dao, J. Kim, S. Seo, B. Egger, and J. Lee, “A performance model for GPUs with caches,” IEEE Trans. Parallel Distrib. Syst., vol. 26, no. 7, pp. 1800-1813, 2015, doi: 10.1109/TPDS.2014.2333526.
[40] C. Lehnert et al., “Performance prediction and ranking of SpMV kernels on GPU architectures,” in Euro-Par 2016: Parallel Processing, Cham, Switzerland: Springer, 2016, doi: 10.1007/978-3-319-43659-3_7.
[41] E. Gianniti, L. Zhang, and D. Ardagna, “Performance prediction of GPU-based deep learning applications,” in Proc. 30th Int. Symp. Comput. Archit. High Perform. Comput. (SBAC-PAD), 2018, doi: 10.1109/CAHPC.2018.8645908.
[42] NVIDIA, “CUDA C programming guide, version 7.0,” NVIDIA Corp., Santa Clara, CA, USA, 2015.
[43] M.A. Hall, Practical Machine Learning Tools and Techniques. Burlington, MA, USA: Morgan Kaufmann, 2011.
[44] P. Carvalho, “Using machine learning techniques to analyze the performance of concurrent kernel execution on GPUs,” Future Gener. Comput. Syst., vol. 113, pp. 528-540, 2020, doi: 10.1016/j.future.2020.07.038.
[45] E. Saberi, E. Radmand, J. Pirgazi, and A. Kermani, “Buying and selling strategy in the Iranian stock market using machine learning models, with feature selection using the Cuckoo Search algorithm,” Soft Comput. J., vol. 12, no. 2, pp. 130-145, 2024, doi: 10.22052/scj.2023.252793.1144 [In Persian].
[46] H. Veisi, H.R. Ghaedsharaf, and M. Ebrahimi, “Improving the Performance of Machine Learning Algorithms for Heart Disease Diagnosis by Optimizing Data and Features,” Soft Comput. J., vol. 8, no. 1, pp. 70-85, 2019, doi: 10.22052/8.1.70 [In Persian].
[47] S.K. Shekofteh, H. Noori, M. Naghibzadeh, H.S. Yazdi, and H. Froning, “Metric selection for GPU kernel classification,” ACM Trans. Archit. Code Optim., vol. 15, no. 4, pp. 1-27, 2019, doi: 10.1145/3295690.