مروری بر روش‌های کاهش سربار نقطه‌وارسی

آغبلاغی تبریزی, مصطفی; شهبازیان, آرین; شاه بختی, کیمیا; زمانی, کوشیار; کنعانی, نوید; اصغری, سید امیر; بینش مروستی, محمدرضا

doi:10.22052/scj.2024.254835.1240

مروری بر روش‌های کاهش سربار نقطه‌وارسی

نوع مقاله : مقاله مروری

نویسندگان

¹ گروه مهندسی برق و کامپیوتر، دانشکده فنی مهندسی، دانشگاه خوارزمی، تهران، ایران

² گروه علوم و فناوری داده، دانشکده مهندسی سامانه های هوشمند، دانشگاه تهران، تهران، ایران

10.22052/scj.2024.254835.1240

چکیده

امروزه تحمل‌پذیری اشکال در سیستم‌های مختلف امری ضروری است. استفاده از روش‌های ایجاد نقطه‌وارسی و نقاط امن برای بازگشت به هنگام خرابی باعث افزایش قابلیت اطمینان سیستم‌ها می‌شود. چالش اصلی در استفاده از روش‌های نقطه‌وارسی سربار آنها است. این سربار بر اثر اجرای فرآیند ایجاد نقطه‌وارسی است و باعث کاهش عملکرد کلی سیستم‌ها می‌شود. روش‌های بسیاری تاکنون به حل این مشکل پرداخته‌اند. این روش‌ها تلاش می‌کنند تا سربار ناشی از نقطه‌وارسی کاهش یابد و در نتیجه سیستم به حداکثر کارایی برسد. در این مقاله به مطالعه و مرور روش‌های گوناگون پیرامون کاهش سربار نقطه‌وارسی پرداخته شده است. این روش‌ها در گروه‌های مختلفی دسته‌بندی شده‌اند. دسته‌بندی‌ها بر اساس نوع پیاده‌سازی نقطه‌وارسی و سطوح مختلف سیستم‌ها مشخص گردیده‌اند. این دسته‌بندی‌ها شامل: نقطه‌های وارسی هماهنگ شده، نقطه‌وارسی در سطح سیستم، نقطه‌وارسی در سطح برنامه و نقطه‌وارسی در سیستم‌های محاسباتی توزیع شده است. در پایان با نمایش کلی دسته‌بندی‌های گفته شده در یک جدول جامع برای هر دسته‌بندی نتیجه‌گیری می‌شود.

کلیدواژه‌ها

موضوعات

علوم کامپیوتر

عنوان مقاله [English]

A survey of checkpoint overhead reduction methods

نویسندگان [English]

Mostafa Aghbolaghi Tabrizi ¹
Arian Shahbazian ¹
Kimiya Shahbakhti ¹
Kooshyar Zamani ¹
Navid Kanaani ¹
Seyyed Amir Asghari ²
Mohammadreza Binesh Marvasti ¹

¹ Department of Electrical and Computer Engineering, Faculty of Engineering, Kharazmi University, Tehran, Iran

² Department of Data Science and Technology, Faculty of Intelligent Systems, University of Tehran, Tehran, Iran

چکیده [English]

Nowadays, fault tolerance in different systems is an essential factor. Using checkpointing methods and safe spots for recovery after failure occur can increase the reliability of systems. The main challenge in using checkpointing methods is their overhead. This overhead made as a result of checkpointing execution and it has negative impact on system performance. Therefore, numerous methods have been introduced to address this problem. These methods aim to reduce the overhead in order to increase system performance. This paper, thoroughly studied and reviewed various checkpointing methods. These methods organized into distinct categories. Then categories are defined based on the type of checkpoint implementation and different levels of systems. Those are such as: coordinated checkpointing, system-level checkpointing, application-level checkpointing, and distributed system checkpointing. Finally, this paper provides a detailed summary in a comprehensive graph and conclusion for each of these categories.

کلیدواژه‌ها [English]

Checkpoint
Checkpoint overhead reduction
Coordinated checkpoint
System-level checkpoint
Application-level checkpoint
Checkpoint in distributed computing systems

مراجع

[1] I.P. Egwutuoha, D. Levy, B. Selic, and S. Chen, “A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems,” J. Supercomput., vol. 65, pp. 1302-1326, 2013, doi: 10.1007/s11227-013-0884-0.
[2] N.H. Vaidya, “Impact of checkpoint latency on overhead ratio of a checkpointing scheme,” IEEE Trans. Comput., vol. 46, no. 8, pp. 942-947, 1997, doi: 10.1109/12.609281.
[3] D. Loreti, M. Artioli, and A. Ciampolini, “Rollback-free recovery for a high performance dense linear solver with reduced memory footprint,” IEEE Trans. Parallel Distrib. Syst., vol. 35, no. 7, pp. 1307-1319, 2024, doi: 10.1109/TPDS.2024.3400365.
[4] H. Benkaouha, A. Abdelli, J. Ben-Othman, Y. Zaffoune, and L. Mokdad, “Distributed implementation of a stable storage for MANET checkpointing protocols,” in Proc. Int. Wireless Commun. Mobile Comput. Conf. (IWCMC), Paphos, Cyprus, 2016, pp. 672-677, doi: 10.1109/IWCMC.2016.7577137.
[5] K. Hwang, H. Jin, E. Chow, C.-L. Wang, and Z. Xu, “Designing SSI clusters with hierarchical checkpointing and single I/O space,” IEEE Concurrency, vol. 7, no. 1, pp. 60-69, 1999, doi: 10.1109/4434.749136.
[6] X. Cui et al., “A snake addressing scheme for phase change memory testing,” Sci. China Inf. Sci., vol. 59, no. 10, p. 102401, 2016, doi: 10.1007/s11432-015-5437-0.
[7] G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill, “Automated application-level checkpointing of MPI programs,” ACM SIGPLAN Notices, vol. 38, no. 10, pp. 84-94, 2003, doi: 10.1145/966049.781513.
[8] I.P. Egwutuoha et al., “A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems,” J. Supercomput., vol. 65, pp. 1302-1326, 2013, doi: 10.1007/s11227-013-0884-0.
[9] E.N. Elnozahy, L. Alvisi, Y.-M. Wang, and D.B. Johnson, “A survey of rollback-recovery protocols in message-passing systems,” ACM Comput. Surv., vol. 34, no. 3, pp. 375-408, 2002, doi: 10.1145/568522.568525.
[10] G. Cao and M. Singhal, “On coordinated checkpointing in distributed systems,” IEEE Trans. Parallel Distrib. Syst., vol. 9, no. 12, pp. 1213-1225, 1998, doi: 10.1109/71.737697.
[11] M.T. Raza, Z. Tan, A. Tufail, and F.M. Anwar, “LTE NFV rollback recovery,” IEEE Trans. Netw. Service Manag., vol. 19, no. 3, pp. 2468-2477, 2022, doi: 10.1109/TNSM.2022.3182008.
[12] J. Plank, K. Li, and M. Puening, “Diskless checkpointing,” IEEE Trans. Parallel Distrib. Syst., vol. 9, no. 10, pp. 972-986, 1998, doi: 10.1109/71.730527.
[13] D. Hakkarinen and Z. Chen, “Multilevel diskless checkpointing,” IEEE Trans. Comput., vol. 62, no. 4, pp. 772-783, 2013, doi: 10.1109/TC.2012.17.
[14] L. Gomez, A. Nukada, N. Maruyama, F. Cappello, and S. Matsuoka, “Low-overhead diskless checkpoint for hybrid computing systems,” in Proc. Int. Conf. High Perform. Comput., 2010, pp. 1-10, doi: 10.1109/HIPC.2010.5713163.
[15] L.A.B. Gomez, N. Maruyama, F. Cappello, and S. Matsuoka, “Distributed diskless checkpoint for large scale systems,” in Proc. 10th IEEE/ACM Int. Conf. Cluster Cloud Grid Comput., 2010, pp. 63-72, doi: 10.1109/CCGRID.2010.40.
[16] H. Li, L. Pang, and Z. Wang, “Two-level incremental checkpoint recovery scheme for reducing system total overheads,” PLOS ONE, vol. 9, no. 8, p. e104591, 2014, doi: 10.1371/journal.pone.0104591.
[17] S. Di, Y. Robert, F. Vivien, and F. Cappello, “Toward an optimal online checkpoint solution under a two-level HPC checkpoint model,” IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 1, pp. 244-259, 2017, doi: 10.1109/TPDS.2016.2546248.
[18] S.S. Manvi and P. Venkataram, “Applications of agent technology in communications: A review,” Comput. Commun., vol. 27, pp. 1493-1508, 2004, doi: 10.1016/j.comcom.2004.05.011.
[19] R. Singh and M. Dave, “Antecedence graph approach to checkpointing for fault tolerance in mobile agent systems,” IEEE Trans. Comput., vol. 62, no. 2, pp. 247-258, 2013, doi: 10.1109/TC.2011.235.
[20] J. Liu et al., “Using proactive fault-tolerance approach to enhance cloud service reliability,” IEEE Trans. Cloud Comput., vol. 6, no. 4, pp. 1191-1202, 2018, doi: 10.1109/TCC.2016.2567392.
[21] N. El-Sayed and B. Schroeder, “To checkpoint or not to checkpoint: Understanding energy-performance-I/O tradeoffs in HPC checkpointing,” in Proc. IEEE Int. Conf. Cluster Comput. (CLUSTER), Madrid, Spain, 2014, pp. 93-102, doi: 10.1109/CLUSTER.2014.6968778.
[22] S.-H. Kang, H.-W. Park, S. Kim, H. Oh, and S. Ha, “Optimal checkpoint selection with dual-modular redundancy hardening,” IEEE Trans. Comput., vol. 64, no. 7, pp. 2036-2048, 2015, doi: 10.1109/TC.2014.2349492.
[23] K.B. Ferreira, R. Riesen, R. Brightwell, P. Bridges, and D. Arnold, “Recent advances in the Message Passing Interface,” Int. J. High Perform. Comput. Appl., 2011, doi: 10.1177/1094342014549273.
[24] S. Kannan, N. Farooqui, A. Gavrilovska, and K. Schwan, “HeteroCheckpoint: Efficient checkpointing for accelerator-based systems,” in Proc. 44th Annu. IEEE/IFIP Int. Conf. Dependable Syst. Netw., Atlanta, GA, USA, 2014, pp. 738-743, doi: 10.1109/DSN.2014.76.
[25] K. Iskra, J.W. Romein, K. Yoshii, and P. Beckman, “ZOID: I/O-forwarding infrastructure for petascale architectures,” in Proc. 13th ACM SIGPLAN Symp. Principl. Pract. Parallel Prog., 2008, pp. 153-162, doi: 10.1145/1345206.1345230.
[26] E. Gelenbe, “A model of roll-back recovery with multiple check-points,” in Proc. 2nd Int. Conf. Softw. Eng. (ICSE’76), 1976, pp. 251-255, doi: 10.5555/800253.807684.
[27] A. Moody, G. Bronevetsky, K. Mohror, and B.R. de Supinski, “Design, modeling, and evaluation of a scalable multi-level checkpointing system,” in Proc. ACM/IEEE Int. Conf. High Perform. Comput. Netw. Storage Anal., New Orleans, LA, USA, 2010, pp. 1-11, doi: 10.1109/SC.2010.18.
[28] X. Tang et al., “An efficient in-memory checkpoint method and its practice on fault-tolerant HPL,” IEEE Trans. Parallel Distrib. Syst., vol. 29, no. 4, pp. 758-771, 2018, doi: 10.1109/TPDS.2017.2781257.
[29] H. Cho, E. Cheng, T. Shepherd, C.-Y. Cher, and S. Mitra, “System-level effects of soft errors in uncore components,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 36, no. 9, pp. 1497-1510, 2017, doi: 10.1109/TCAD.2017.2651824.
[30] S. Kannan, A. Gavrilovska, K. Schwan, and D. Milojicic, “Optimizing checkpoints using NVM as virtual memory,” in Proc. 27th Int. Symp. Parallel Distrib. Process. (IPDPS), 2013, pp. 29-40, doi: 10.1109/IPDPS.2013.69.
[31] L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu, “A software memory partition approach for eliminating bank-level interference in multicore systems,” in Proc. 21st Int. Conf. Parallel Archit. Compilation Techn. (PACT), 2012, pp. 367-376.
[32] X. Liao, Z. Zhang, H. Liu, and H. Jin, “Improving bank-level parallelism for in-memory checkpointing in hybrid memory systems,” IEEE Trans. Big Data, vol. 8, no. 2, pp. 289-301, 2022, doi: 10.1109/TBDATA.2018.2865964.
[33] E. Lee, J.E. Jang, T. Kim, and H. Bahn, “On-demand snapshot: An efficient versioning file system for phase-change memory,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 12, pp. 2841-2853, 2013, doi: 10.1109/TKDE.2013.35.
[34] Btrfs, “Main Page,” 2013. [Online]. Available: http://btrfs.wiki.kernel.org
[35] C.-Y. Lin, L.-C. Wang, and S.-P. Chang, “Incremental checkpointing for fault-tolerant stream processing systems: A data structure approach,” IEEE Trans. Emerging Topics Comput., vol. 10, no. 1, pp. 124-136, 2022, doi: 10.1109/TETC.2020.2986487.
[36] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz, “Aries: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging,” ACM Trans. Database Syst., vol. 17, no. 1, pp. 94-162, 1992, doi: 10.1145/128765.128770.
[37] N. Malviya, A. Weisberg, S. Madden, and M. Stonebraker, “Rethinking main memory OLTP recovery,” in Proc. IEEE Int. Conf. Data Eng. (ICDE), Chicago, IL, USA, 2014, pp. 604-615.
[38] H. Kim, N. Agrawal, and C. Ungureanu, “Revisiting storage for smartphones,” ACM Trans. Storage, vol. 8, no. 4, pp. 14:1-14:25, 2012, doi: 10.1145/2385603.2385607.
[39] K. Zhong et al., “Towards fast and lightweight checkpointing for mobile virtualization using NVRAM,” IEEE Trans. Parallel Distrib. Syst., vol. 30, no. 6, pp. 1421-1433, 2019, doi: 10.1109/TPDS.2018.2886906.
[40] A. Mirhosseini, A. Agrawal, and J. Torrellas, “Survive: Pointer-based in-DRAM incremental checkpointing for low-cost data persistence and rollback-recovery,” IEEE Comput. Archit. Lett., vol. 16, no. 2, pp. 153-157, 2017, doi: 10.1109/LCA.2016.2646340.
[41] M. Prvulovic, Z. Zhang, and J. Torrellas, “ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors,” in Proc. 29th Annu. Int. Symp. Comput. Archit. (ISCA), Anchorage, AK, USA, 2002, pp. 111-122, doi: 10.1109/ISCA.2002.1003567.
[42] F. Cappello, “Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities,” Int. J. High Perform. Comput. Appl., vol. 23, no. 3, pp. 212-226, 2009, doi: 10.1177/1094342009106189.
[43] I. Cores, G. Rodriguez, M.J. Martin, and P. Gonzalez, “Reducing application-level checkpoint file sizes: Towards scalable fault tolerance solutions,” in Proc. IEEE 10th Int. Symp. Parallel Distrib. Process. Appl. (ISPA), Leganes, Spain, 2012, pp. 371-378, doi: 10.1109/ISPA.2012.55.
[44] K. Parasyris, K. Keller, L. Bautista-Gomez, and O. Unsal, “Checkpoint restart support for heterogeneous HPC applications,” in Proc. 20th IEEE/ACM Int. Symp. Cluster Cloud Internet Comput. (CCGRID), Melbourne, VIC, Australia, 2020, pp. 242-251, doi: 10.1109/CCGrid49817.2020.00-69.
[45] F. Shahzad, J. Thies, M. Kreutzer, T. Zeiser, G. Hager, and G. Wellein, “CRAFT: A library for easier application-level checkpoint/restart and automatic fault tolerance,” IEEE Trans. Parallel Distrib. Syst., vol. 30, no. 3, pp. 501-514, 2019, doi: 10.1109/TPDS.2018.2866794.
[46] F. Li et al., “Checkpointing-aware loop tiling for energy harvesting powered nonvolatile processors,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 38, no. 1, pp. 15-28, 2019, doi: 10.1109/TCAD.2018.2803624.
[47] K. Dichev, D. De Sensi, D.S. Nikolopoulos, K.W. Cameron, and I. Spence, “Power Log’n’Roll: Power-efficient localized rollback for MPI applications using message logging protocols,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 6, pp. 1276-1288, 2022, doi: 10.1109/TPDS.2021.3107745.
[48] P. Sigdel, X. Yuan, and N.-F. Tzeng, “Realizing best checkpointing control in computing systems,” IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 2, pp. 315-329, 2021, doi: 10.1109/TPDS.2020.3015805.
[49] B. Schroeder and G.A. Gibson, “A large-scale study of failures in high-performance computing systems,” IEEE Trans. Dependable Secure Comput., vol. 7, no. 4, pp. 337-350, 2010, doi: 10.1109/TDSC.2009.4.
[50] M. Chtepen et al., “Adaptive task checkpointing and replication: Toward efficient fault-tolerant grids,” IEEE Trans. Parallel Distrib. Syst., vol. 20, no. 2, pp. 180-190, 2009, doi: 10.1109/TPDS.2008.93.
[51] E. Deelman et al., “Pegasus, a workflow management system for science automation,” Future Gener. Comput. Syst., vol. 46, pp. 17-35, 2015, doi: 10.1016/j.future.2014.10.008.
[52] M. Wilde, M. Hategan, J.M. Wozniak, B. Clifford, D.S. Katz, and I. Foster, “Swift: A language for distributed parallel scripting,” Parallel Comput., vol. 37, no. 9, pp. 633-652, 2011, doi: 10.1016/j.parco.2011.05.005.
[53] L. Han, L.-C. Canon, H. Casanova, Y. Robert, and F. Vivien, “Checkpointing workflows for fail-stop errors,” IEEE Trans. Comput., vol. 67, no. 8, pp. 1105-1120, 2018, doi: 10.1109/CLUSTER.2017.14.
[54] A. Abdi, S.A. Asghari, H. Pedram, and H. Taheri, “An instruction level software redundancy based method to intra-inter block control flow checking,” Soft Comput. J., vol. 1, no. 1, pp. 56-69, 2012, dor: 20.1001.1.23223707.1391.1.1.114.1 [In Persian].
[55] E. Asadollahi and S. A. Asghari, "Prediction of appropriate number of virtual machines based on time series and artificial methods via virtual machines clustering," Soft Comput. J., vol. 6, no. 1, pp. 66-77, 2017 [In Persian].
[56] M.J. Nadjafi-Arani and S. Doostali, “Cost-based workflow scheduling using algebraic structures,” Soft Comput. J., vol. 9, no. 2, pp. 114-129, 2021, doi: 10.22052/scj.2021.242814.0. [In Persian]
[57] M. Salehi et al., “Two-state checkpointing for energy-efficient fault tolerance in hard real-time systems,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 24, no. 7, pp. 2426-2437, 2016, doi: 10.1109/TVLSI.2015.2512839.
[58] S. Punnekkat, A. Burns, and R. Davis, “Analysis of checkpointing for real-time systems,” Real-Time Syst., vol. 20, no. 1, pp. 83-102, 2001, doi: 10.1023/A:1026589200419.
[59] R. Melhem, D. Mosse, and E. Elnozahy, “The interplay of power management and fault recovery in real-time systems,” IEEE Trans. Comput., vol. 53, no. 2, pp. 217-231, 2004, doi: 10.1109/TC.2004.1261830.
[60] K.H. Kim and J. Kim, “An adaptive DVS checkpointing scheme for fixed-priority tasks with reliability constraints in dependable real-time embedded systems,” in Proc. 3rd Int. Conf. Embedded Softw. Syst. (ICESS), 2007, pp. 560-571, doi: 10.1007/978-3-540-72685-2_52.
[61] Y. Li and Z. Lan, “FREM: A fast restart mechanism for general checkpoint/restart,” IEEE Trans. Comput., vol. 60, no. 5, pp. 639-652, 2011, doi: 10.1109/TC.2010.129.