Improving the Performance of Machine Learning Algorithms for Heart Disease Diagnosis by Optimizing Data and Features

Authors

Abstract

Heart is one of the most important members of the body, and heart disease is the major cause of death in the world and Iran. This is why the early/on time diagnosis is one of the significant basics for preventing and reducing deaths of this disease. So far, many studies have been done on heart disease with the aim of prediction, diagnosis, and treatment. However, most of them have been mostly focused on the prediction of heart disease. The purpose of this study is to develop models for heart disease diagnosis using machine learning, neural network, and deep learning algorithms. The models have been developed using the Cleveland heart disease dataset from University of California Irvine (UCI) repository. After complete data processing, including outlier detection, normalization, discretization, feature selection and feature extraction, the dataset is transformed into two normalized data and discretized data, according to the nature of the algorithms. Moreover, in constructing models of machine learning and neural networks, two randomized searches with cross-validation and grid search with Talos scan approaches are used for model tuning. Among evaluated models, including decision tree algorithms, random forest, support vector machine (SVM) and XGBoost, the highest accuracy is 92.9% using SVM, and among neural network models, multilayer perceptron (MLP) has resulted in the highest accuracy of 94.6%.

Keywords


  1. [1] J. Nahar, T. Imam, K. S. Tickle, and Y.-P. P. Chen, “Association rule mining to detect factors which contribute to heart disease in males and females,” Expert Syst. Appl., vol. 40, no. 4, pp. 1086–1093, 2013. [2] K. D. M. Nikhil Kumar, K. V. S. Koushik, “Prediction of Heart Diseases using Data Mining and Machine Learning Algorithms and Tools,” nternational J. Sci. Res. Comput. Sci. Eng. Inf. Technol., vol. 3, no. 3, pp. 887–898, 2018. [3] M. S. Mahmoodi, “Designing a Heart Disease prediction System using Support Vector Machine,” J. Heal. Biomed. Informatics, vol. 4, no. 1, 2017. [4] M. Hassanzadeh, I. Zabbah, and K. Layeghi, “Diagnosis of Coronary Heart Disease using Mixture of Experts Method,” J. Heal. Biomed. Informatics, vol. 5, no. 2, 2018. [5] A. Dekamin and A. Sheibatolhamdi, “Research Paper: A Data Mining Approach for Coronary Artery Disease Prediction in Iran,” J. Adv. Med. Sci. Appl. Technol. Adv. Med. Sci. Appl. Technol., vol. 3, no. 31, pp. 29–38, 2017. [6]میهن حسین نژادگرگری، زینب اصغری، علی ولیان خروانق، "نقش داده‌کاوی در سلامت"، کنفرانس بین المللی پژوهش در علوم و تکنولوژی، تهران، موسسه سرآمد همایش کارین، 1394. [7] P. Soleimani and A. Neshati, “Applying the Regression Technique for Prediction of the Acute Heart Attack,” World Acad. Sci. Eng. Technol. Int. J. Medical, Heal. Biomed. Bioeng. Pharm. Eng., vol. 9, no. 11, pp. 763–767, 2015. [8] B. Venkatalakshmi and M. V Shivsankar, “Heart Disease Diagnosis using Predictive Data Mining,” Int. J. Innov. Res. Sci. Eng. Technol., vol. 3, no. 3, pp. 1873–1877, 2014. [9] U. Shafique, F. Majeed, H. Qaiser, and I. U. Mustafa, “Data mining in healthcare for heart diseases,” Int. J. Innov. Appl. Stud., vol. 10, no. 4, p. 1312, 2015. [10] S. Bashir, U. Qamar, and F. H. Khan, “WebMAC: A web based clinical expert system,” Inf. Syst. Front., vol. 20, no. 5, pp. 1135–1151, Oct. 2018. [11] S. Maji and S. Arora, “Decision Tree Algorithms for Prediction of Heart Disease,” in Information and Communication Technology for Competitive Strategies, 2019, pp. 447–454. [12] M. Kazemi, H. Mehdizadeh, and A. Shiri, “Heart disease forecast using neural network data mining technique,” J. ilam Univ. Med. Sci., vol. 25, no. 1, pp. 20–32, May 2017. [13] H. Ghaedsharaf, M. H. Sadredini, R. Khayami, and M. A. Babaei Beigi, “Extract effective factors Incidence of coronary artery disease using association rules,” in The 1st National Conference on Recent Advances in Engineering and Modern Sciences, 1397. [14] G. Piatetsky, “Latest KDnuggets Poll asked,” 2014.[Online].Available:https://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html. [15] O. Niaksu, “CRISP Data Mining Methodology Extension for Medical Domain,” Balt. J. Mod. Comput., vol. 3, pp. 92–109, 2015. [16] IBM Knowledge Center, “Overview (OPTIMAL BINNING command).” [Online]. Available: https://www.ibm.com/support/knowledgecenter/en/SSLVMB_23.0.0/spss/data_validation/syn_optimal-binning_overview.html. [Accessed: 19-Jun-2019]. [17] Mathias, “Discretizing a continuous variable using Entropy,” 2013. [Online]. Available: http://clear-lines.com/blog/post/Discretizing-a-continuous-variable-using-Entropy.aspx. [Accessed: 23-Jun-2019]. [18] A. Badr El Din Ahmed and I. Sayed Elaraby, “PER: A prediction for Student’s Performance Using Decision Tree ID3 Method,” India - World J. Comput. Appl. Technol., vol. 2, no. 2, pp. 43–47, 2014. [19] Y. Wang, X. Ma, and P. Qian, “Wind Turbine Fault Detection and Identification Through PCA-Based Optimal Variable Selection,” IEEE Trans. Sustain. Energy, vol. 9, no. 4, pp. 1627–1635, 2018. [20] S. Wold, K. I. M. Esbensen, and P. Geladi, “Principal Component Analysis,” vol. 2, pp. 37–52, 1987. [21] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. 2012. [22] جواد محجوبی، امیر فرشاداعتماد شهیدی، "تخمین ارتفاع امواج ناشی از باد در نکاء به کمک درختان تصمیم رگرسیونی"، اولین کنفرانس داده‌کاوی ایران، تهران، دانشگاه صنعتی امیرکبیر، موسسه پژوهشی داده پردازان گیتا، 1386. [23] S. Kumar and G. Sahoo, “A Random Forest Classifier based on Genetic Algorithm for Cardiovascular Diseases Diagnosis,” vol. 30, no. 11, pp. 1723–1729, 2017. [24] R. Mitchell and E. Frank, “Accelerating the XGBoost algorithm using GPU computing,” PeerJ Comput. Sci., vol. 3, p. e127, 2017. [25] S. Belaid and A. Mellit, “Prediction of daily and mean monthly global solar radiation using support vector machine in an arid climate,” Energy Convers. Manag., vol. 118, pp. 105–118, 2016. [26] وحید یادگاری، احمد رضا متین‌فر "شناسایی حملات منع سرویس وب با استفاده از آنتروپی و الگوریتم ماشین بردار پشتیبان" مجله پدافند الکترونیکی، شماره 6، صفحات79-89، 1397 [27] H. R. Ansari, M. J. Zarei, S. Sabbaghi, and P. Keshavarz, “A new comprehensive model for relative viscosity of various nanofluids using feed-forward back-propagation MLP neural networks,” Int. Commun. Heat Mass Transf., vol. 91, no. December 2017, pp. 158–164, 2018. [28] “UCI Machine Learning Repository: Heart Disease Data Set.”[Online]. Available: https://archive.ics.uci.edu/ml/datasets/heart+Disease. [Accessed: 17-Mar-2019]. [29] “Cholesterol levels by age: Differences and recommendations.”[Online].Available: https://www.medicalnewstoday.com/articles/315900.php. [Accessed: 19-Mar-2019]. [30] “Blood Pressure : Blood pressure chart.” [Online]. Available:http://www.bloodpressureuk.org/BloodPressureandyou/Thebasics/Bloodpressurechart. [Accessed: 19-Mar-2019]. [31] “Heart rate: What is a normal heart rate?” [Online].Available:https://www.medicalnewstoday.com/articles/235710.php. [Accessed: 19-Mar-2019]. [32] G. Barbato, E. M. Barini, G. Genta, and R. Levi, “Features and performance of some outlier detection methods,” J. Appl. Stat., vol. 38, no. 10, pp. 2133–2149, 2011. [33] C. Leys, O. Klein, Y. Dominicy, and C. Ley, “Detecting multivariate outliers: Use a robust variant of the Mahalanobis distance,” J. Exp. Soc. Psychol., vol. 74, pp. 150–156, Jan. 2018. [34] بهزاد مهرگینی، حسین معماریان، "ارزیابی کارایی روش فاصله ماهالانوبیس در تفکیک رخساره‌های نفتی، در یکی از میادین هیدروکربوری ایران," چهاردهمین کنفرانس ژئوفیزیک ایران، تهران، انجمن ژئوپلتیک ایران، 1389. [35] C. Saranya and G. Manikandan, “A study on normalization techniques for privacy preserving data mining,” Int. J. Eng. Technol., vol. 5, no. 3, pp. 2701–2704, 2013. [36] Z. Liu, “Procedia A method of SVM with Normalization in Intrusion Detection,” Procedia Environ. Sci., vol. 11, pp. 256–262, 2011. [37] H. Sajedi and M. Taslimi, “Author gender identification from text using Bayesian Random Forest,” Signal Data Process., vol. 16, no. 1, 2019. [38] J. Benesty, J. Chen, and Y. Huang, “On the Importance of the Pearson Correlation Coefficient in Noise Reduction,” IEEE Trans. Audio. Speech. Lang. Processing, vol. 16, no. 4, pp. 757–765, May 2008. [39] J. H. Seong and D. H. Seo, “Wi-Fi fingerprint using radio map model based on MDLP and euclidean distance based on the Chi squared test,” Wirel. Networks, vol. 9, pp. 1–9, 2018. [40] Maria Jensen, “Using Talos for Feature Hyperparameter Optimization?” [Online]. Available: https://neurospace.io/blog/2019/04/using-talos-for-feature-hyperparameter-optimization/. [Accessed: 24-Jun-2019]. [41] S. D. Desai, I. F. Dessai, and L. Kulkarni, “Intelligent Heart Disease Prediction System Using Probabilistic Neural Network,” J. Adv. Comput. Theory Eng., vol. 2, no. 3, pp. 38–44, 2013.