Discovering the Clinical Knowledge about Breast Cancer Diagnosis Using Rule-Based Machine Learning Algorithms

Nopour, R.; Kazemi-Arpanahi, H.; Shanbehzadeh, M.

	Discovering the Clinical Knowledge about Breast Cancer Diagnosis Using Rule-Based Machine Learning Algorithms
Health Education and Health Promotion
Article 13, Volume 10, Issue 1, 2022, Pages 89-97 PDF (652.37 K)
Document Type: Descriptive & Survey
Authors
R. Nopour¹; H. Kazemi-Arpanahi²; M. Shanbehzadeh^* ³
¹Department of Health Information Management, Student Research Committee, School of Health Management and Information Sciences Branch, Iran University of Medical Sciences, Tehran, Iran
²Department of Health Information Technology, Abadan University of Medical Sciences, Abadan, Iran
³Department of Health Information Technology, School of Paramedical, Ilam University of Medical Sciences, Ilam, Iran
Abstract
Aims: Breast cancer represents one of the most prevalent cancers and is also the main cause of cancer-related deaths in women globally. Thus, this study was aimed to construct and compare the performance of several rule-based machine learning algorithms in predicting breast cancer. Instrument & Methods: The data were collected from the Breast Cancer Registry database in the Ayatollah Taleghani Hospital, Abadan, Iran, from December 2017 to January 2021 and had information from 949 non-breast cancer and 554 breast cancer cases. Then the mean values and K-nearest neighborhood algorithm were used for replacing the lost quantitative and qualitative data fields, respectively. In the next step, the Chi-square test and binary logistic regression were used for feature selection. Finally, the best rule-based machine learning algorithm was obtained based on comparing different evaluation criteria. The Rapid Miner Studio 7.1.1 and Weka 3.9 software were utilized. Findings: As a result of feature selection the nine variables were considered as the most important variables for data mining. Generally, the results of comparing rule-based machine learning demonstrated that the J-48 algorithm with an accuracy of 0.991, F-measure of 0.987, and also AUC of 0.9997 had a better performance than others. Conclusion: It’s found that J-48 facilitates a reasonable level of accuracy for correct BC risk prediction. We believe it would be beneficial for designing intelligent decision support systems for the early detection of high-risk patients that will be used to inform proper interventions by the clinicians.
Keywords
Machine learning; Artificial intelligent; data mining; Breast Cancer; Decision Tree
Full Text
Introduction Breast Cancer (BC) is the most fatal and frequent malignancy among women with an estimated 11.7% of all cancer cases and about 20% of all cancer-related deaths. Globally it t is the second leading cause of cancer death among people (men and women) after lung malignancies in both developing and developed countries ^[1]. Based on the global cancer report, BC was the most commonly diagnosed cancer in 2020, with 2.3 million new cases ^[2]. Early detection and screening can significantly decrease patient costs, improve the overall likelihood of treatment and survivability ^[3]. Today, evidence suggests that BC is a global challenge due to its heterogeneous, multifactorial, violent nature, and destructive health effects ^{[4, 5]}. Reportedly, it is now well established that the malignant BC is often aggressive, forming in the early stages in the glands and mammary ducts ^[6] and then metastasizing to the surrounding tissues, adjacent lymph nodes, and, specifically, to the bones, liver, brain, or lungs in the advanced stages ^[7]. Most regrettably, many cases of malignancy are detected late in the advanced stages of the disease such that the tumor has metastasized to the tissues around the breast, axillary lymph nodes, and even other organs ^{[8, 9]}. Therefore, there is a growing body of literature that recognizes the benefits of systematic and up-to-date screening policies in this regard ^[10]. The most well-known methods for screening the disease include mammography, thermography, and tissue sampling techniques which are thoroughly implemented more seriously in many developed countries ^[11]. However, the mentioned screening methods are time-consuming, expensive, and highly complicated. On the other hand, recently, there has been renewed interest in some techniques, including breast self-examination (BSE) and clinical breast examination (CBE). Despite their cheapness and availability, studies have reported challenging and different results on their effectiveness ^{[12, 13]}. There are several clinical and non-clinical factors influencing the incidence of BC ^[14]. Due to the different stages and severity of BC and the existence of some ambiguities and unpredictable situations regarding its outcomes, which, in turn, necessitates adopting innovative technologies for screening ^[15]. Recently, researchers have shown an increased interest in the deployment of newly-developed digital technological and non-invasive methods such as artificial intelligence (AI) systems which can be effective in rapid, accurate, and timely diagnosis of malignancies ^[16]. Specifically, the rapid diagnosis of cancers in the early stages is considered the most significant factor for definitive treatment of the disease, prevention of unpleasant complications, and increasing patients’ survival chances ^[17]. Machine learning (ML), a subset of AI, has many applications in many industries, including healthcare ^[18]. The ML plays a crucial role in managing malignancies such as prognosis, diagnosis, and treatment outcomes from the big data available in the medical field ^[19]. In the last few decades, several ML-based methods have been developed for the effective and timely prognosis and screening of BC ^{[20, 21]}. These methods will support decisions by extracting hidden patterns and applied knowledge from the raw dataset ^[22]. The clinical decision support systems (CDSSs) based on rule-based logic ^[23] and decision tree (DT) algorithms ^[24] are considered useful, practical, and flexible tools for modeling medical diagnoses and supporting complex decisions ^[23]. Rule-based machine learning (RBML) is increasingly adopted due to different stages and degrees of severity and some ambiguities and unpredictable situations in the behavior and outcome of the disease besides various clinical and non-clinical factors involved in BC emergence and progression ^{[23, 25]}. So far, several studies have been evaluating the application of ML algorithms in BC risk classification and prediction based on clinical variables. Momenyan et al. developed an optimum ML-based intelligent model for classifying the BC risk ^[26]. Researchers compared three different ML algorithms for BC risk classification ^{[27, 28]}. In another study conducted by Solanki and their colleagues, they investigated the prediction of benign or malignant BC using selected ML techniques ^[29]. Finally, Salod et al. in their work compared the performance of eight ML algorithms in BC screening and detection ^[2]. In recent years, many RBML techniques are applied to predicting BC and classifying disease outcomes. Therefore, this study was aimed to develop an appropriate and scientific screening model based on the selected RBML for earlier detection of the disease, improve diagnostic efficiency and decrease the risk of mortalities caused by BC. Instrument and Methods This retrospective single-center study aimed to develop a BC risk prediction model using seven popular RBML algorithms and selecting the best performing. Models were trained and evaluated on the data of suspected BC from December 2017 to January 2021. BC cases were extracted from the BC Registry database in the Ayatollah Taleghani Hospital, Abadan, Iran. The Registry database contains 2854 patient records with 30 features. The independent features (input) are categorized into 6 main classes patient characteristics, nutritional factors, medical history, history of BC and related interventions, clinical manifestations, and epidemiological factors input variables. The dependent variable (output) is the diagnosis of BC by two values of 0 and 1 associated with non-BC and BC cases, respectively. Primary variables of the registry database associated with the BC prognosis are listed below: - Demographic: Age, job, education, nationality, the ratio of waist to breast, and Body Mass Index; - History of diseases: Salt, vegetable, dairy, fruit (average in days from 5 years ago), fast food, and oil consumption; - Nutritional factors: Diabetes, common cold, hyperlipidemia, hyperglyceridaemia, hypercholesterolemia, hypertension, and fatness; - History of breast cancer and interventions: A personal history of breast cancer, history of breast sampling, history of chest radiotherapy, and family history of breast cancer; - Clinical manifestations: Exist a mass in the upper quarter of the breast or the unspecified region of the breast; - Epidemiological factors: Walking, heavy job, physical, optimal physical activities, and alcohol consumption; - Outcome: BC and non-BC. After applying exclusion criteria, ultimately the 1668 case records were chosen for the study (Diagram 1). Diagram 1) Flow chart describing patient selection The Abadan University of Medical Science ethics board approved the study design. Before implementing the ML algorithms, preprocessing was performed on the raw dataset. This stage is a common requirement for many ML predictions. For this purpose, we removed the samples with more than 70% missing data from the analysis. Then, for other missing fields, we used the average of the existing available values and the K-nearest neighborhood (KNN) Euclidean distance for the quantitative and qualitative variables, respectively. The model’s implementation was done by Rapid Miner Studio 7.1.1 environment. To select the best predictors and reduce the dataset dimension, the independent Chi-square test was used for determining the relationship between each independent variable (30 variables) with the dependent (BC diagnosis: Yes or No) as the output class. The p<0.01 is considered as a statistically significant level in this respect. After determining the most important affecting factors in BC, we trained a set of RBML algorithms such as J-48, random-forest (RF), random-tree (RT), and REP-Tree, decision table (DT), J-RIP, and Part were applied to classify the diagnosis value of the dataset and eliciting the knowledge about BC classification with IF-THEN structure. These techniques are used for discovering the knowledge and hidden patterns that existed for diagnosing BC in the dataset. The Weka software 3.9 was utilized in this respect. In the last phase, the performance of all algorithms was assessed by criteria such as positive predicted value (PPV), negative predicted value (NPV), sensitivity, specificity, accuracy, F-score, and are calculating the area under receiver operator characteristics (AUC-ROC). The confusion matrix has been used for measuring the capabilities of each data mining algorithm in classification. They are calculated as follows: The True Positive (TP) and True Negative (TN) are the numbers of positive and negative cases that have and do non-having BC and are truly classified by algorithms as positive and negative, respectively. False Positive (FP) and False Negative (FN) are also the numbers of non-BC and BC cases that are incorrectly classified as positive and negative cases by algorithms, respectively. The 10-fold cross-validation has been utilized for determining and comparing all data mining performance for considering the errors that existed in algorithms performance calculation in this respect. After determining the best algorithm using different performance criteria, in the last step, the best knowledge for diagnosing BC has been obtained using the IF-THEN structure, and the rules with the more classified samples were considered the main knowledge for diagnosing BC. Findings The 554 and 949 cases associated with the positive and negative BC cases, respectively have remained and were included for statistical analysis. The mean age of the afflicted women was 48.146±13.074 years and in non-afflicted cases was 43.212±9.70 years. Table 1 shows the basic data of the two groups of individuals. Based on the results, 18 variables had a significant relationship with diagnosing the BC using the Chi-square test at p<0.01. The variables of upper in quadrants breast cancer, history of chest radiotherapy, and fatness were considered as the most three important factors for diagnosing the BC at p<0.001 (Table 2). Table 1) The frequency results of demographic variables Table 2) The most important BC prediction factors at p<0.01 The results of determining the combinational correlation between the BC diagnostic factors and the dependent variable using binary logistic regression (BLR) and forward logistic regression method have been brought in IF-Term Removed Table (Table 3). As depicted in Table 3, in the 9^th step of the BLR, by entering the 9 variables of history of breast sampling, history of chest radiotherapy, family history of BC, alcohol consumption, vegetable consumption, diabetes, physical activity, age, and upper in quadrants breast cancer, the average of log-likelihood of the model has been obtained -61.91 at p<0.01. In conclusion, by selecting these nine variables in the BLR model and reducing the Log-likelihood, the performance of the BLR has been increased and therefore, these variables had a significant hybrid correlation coefficient with output class at p<0.01. The results of comparing the performance of selected RBML algorithms in BC classification using the confusion matrix showed that the DT was the only algorithm that by FP=0 and TN=949, has classified all the non-BC samples correctly, and was a better algorithm than others in this regard. The J-48 decision tree algorithm with FP=1 and TN=948 had also the pleasant capability of classifying the non-BC cases. Also, this algorithm with FN=12 and TP=542 had a better performance in classifying the positive cases than other algorithms. The results of measuring the evaluation criteria of PPV, NPV, sensitivity, specificity, accuracy, and F-score of these algorithms have been demonstrated in Diagram 2. Based on the results, although, the DT rule-based algorithm with NPV=1 demonstrated the best capability in just classifying the negative BC cases, generally, the J-48 decision tree algorithm with accuracy=0.991 and F-measure=0.987 has obtained the best performance in classifying all research samples than other algorithms. The ROC of all RBML algorithms has been shown in Diagram 3. Generally, investigating all the algorithms classification performance using different evaluation criteria showed that the J-48 decision tree algorithm with PPV of 0.998, NPV of 0.987, the sensitivity of 0.978, specificity of 0.998, accuracy of 0.991, F-measure of 0.987, and also AUC of 0.9997 yielded the best performance than other algorithms for predicting the BC risk. In Diagram 4, the J-48 decision tree algorithm has been depicted and all technical characteristics used in this study have been mentioned. Finally, the best knowledge about diagnosing BC with the more classified sample extracted from this algorithm with IF-THEN structures has been brought and then interpreted. The most important technical features utilized for building J-48 with the best performance include the number of batch size=100, binary split=False, collapse tree=True, confidence factor=0.25, number of minimal objects=2, number of decimal places=2, number of folds=3, reduced error pruning=True, and number of seeds=1. Some knowledge extracted from the J-48 decision tree algorithm with highly classified samples: IF Radio therapy=Yes THEN Diagnosis=breast cancer; IF Radio therapy=No & Alcohol=Yes THEN Diagnosis=breast cancer; IF Radio therapy=No & Alcohol=No & Age <=38 THEN Diagnosis=Non-breast cancer. Based on the J-48 decision tree algorithm’s diagram, the history of chest radiotherapy has been considered as the most important factor for diagnosing BC. Generally, three rules have been obtained as the most important patterns, as below: 1- The first rule was only based on the history of the chest radiotherapy as a condition, this means that in 455 of the positive cases, the history of chest radiotherapy has been seen and if one person has this risk factor, the probability of afflicting BC can be 82.1%; 2- In the second rule, if the person without any history of chest radiotherapy with alcohol consumption, the probability of afflicting BC can be 11.3% (63 positive samples have been classified truly); 3-The third rule is very important for diagnosing the non-BC cases, and if a person without any history of chest radiotherapy, non-alcoholic and less than 38 years, the probability of non-afflicting BC can be 89.5% (850 truly classified samples/ 949 total negative samples). Table 3) IF-Term removed table for BC diagnostic factors (p<0.001) Table 4) The selected RBML confusion matrix Diagram 2) Various performance evaluation criteria of different RBML algorithms (The vertical and horizontal vertices of the diagram show the True Positive Rate (TPR) and False Positive Rate (FPR), respectively) Diagram 3) The ROC of different RBML algorithms Diagram 4) The J-48 pruned decision tree Discussion The purpose of the current study was to effectively determine BC cases through intelligent RBML techniques. In the present study, multiple RBML-based predictive models were developed for early risk prediction of BC based on 1668 suspected BC clinical data. Thus, we trained seven RBML algorithms including J-48, RF, RT, and REP-Tree, DT, J-RIP, and Part according to the top related parameters affecting the risk of BC that derived from a correlation coefficient analysis. The selected algorithms were applied to the pre-processed dataset. This study first selected the most reliable and clinically relevant predictors related to BC by using the independence Chi-square test. Hence we identified nine highly correlated variables that had the meaningful hybrid correlation coefficient with output class at P<0.05. It is proven that ML can be an effective tool in dealing with BC problems ^[30]. Then to validate the system, the k-fold cross-validation method was used. To compare the performance of selected RBML classifiers, several evaluation metrics derived from confusion matrices such as PPV, NPV, sensitivity, specificity, accuracy; F-score, and AUC-ROC were used. So far, several studies have been evaluating the application of ML algorithms in BC risk classification and prediction based on clinical variables ^[31]. The Momenyan results showed J-48 gained optimum predictive performance with an accuracy of 93.3% ^[26]. Regarding the obtained results by Alickovic et al. portrayed J-48 DT method was able to predict the probability of BC more accurately compared with other classifiers ^[32]. The best meaningful results in Dawngliani's study were obtained from the J-48 model with an accuracy of 84.21% while a random tree demonstrates the lowest accuracy (76.49%) ^[33]. Saabith also stated that the J-48 was the best ML technique to predict BC with an accuracy of 79.97% ^[34]. Solanki et al. in their study revealed that J-48 with an accuracy of 98.83% and AUC of 0.983 gained the best performance for BC diagnosis and differentiating the benign and malignant patients ^[29]. Similarly, Al-Salihy showed J-48 DT algorithm is outperformed by other algorithms with an accuracy of 97.7% ^[35]. Ortega's study presented that the application of the J-48 algorithm in BC risk assessment had optimum accuracy (95%) in risk classification and disease screening ^[36]. In Silva's and Mohammed's work authors concluded that J-48 yielded better performance than others (accuracy of 91 and 98.2%, respectively)^{[27, 28]}. The results of Solanki's research showed that the model developed by J-48 yielded the best performance in terms of classification accuracy ^[29]. Ultimately Salod's results showed that the model developed using J-48 with 0.81 of AUC was introduced as the best performing model ^[2]. Hence the purpose of these researches is to propose the most effective ML-based predictive models for early BC prognosis by classification of BC risks. In our study, we applied two feature selection methods including the independence Chi-square test and BLR as a hybrid correlation method for determining the most important factors affecting BC. The independence Chi-square test showed that the 18 diagnostic variables acquired the Chi-square at p<0.01 and therefore, were considered as the most important factors determining BC. Also, the results of using the BLR showed that the nine variables including the history of breast sampling, history of chest radiotherapy, family history of breast cancer, alcohol consumption, vegetable consumption, diabetes, physical activity, age, upper in quadrants breast cancer at nine steps of the BLR had a common hybrid correlation with BC diagnosis at p<0.05, and therefore, were used for making the decision trees and aftermath knowledge representation. The experimental results of the present work similar to the reviewed studies showed that the J-48 decision tree with PPV=0.998, NPV=0.987, sensitivity=0.978, specificity=0.998, accuracy=0.991, F-measure=0.987, and also AUC=0.9997 has the best capability for earlier detection of the disease, improve diagnostic efficiency and decrease the risk of mortalities caused by BC. The results of the present study may help physicians throughout correct, accurate, and timely diagnosis of the disease and reduce the severe complications of the disease and the resulting mortality. Despite the small amount of data fed into the models and the lack of clinical variables, the selected RBML models, especially the J-48 algorithm, performed well. On the other hand, this model application in real clinical environments will assist physicians owing to its simplicity, user-friendliness, and easy-to-use characteristics. Given the power of the current study in the timely and accurate prediction of BC risk, this study had some limitations that need to be addressed. First, this is a retrospective study that suffers from low data quantity (missing or duplicate cells) and non-optimal quality (imbalanced, noisy, and meaningless values). Second, we deal with a single-center dataset with a limited sample size which undoubtedly confines the generalizability of the proposed model. Moreover, we used only seven RBML algorithms for prediction analyses based on some clinical features. Finally, the selected registry dataset lacks some important variables such as Para-clinical indicators. In the future, the performance accuracy of our model and its generalizability will be enhanced if we test more ML techniques, at the larger, multicenter, and prospective dataset which is equipped with more qualitative and validated data. The obtained results confirm the positive effect of nine selected features in predicting the risk of BC as a powerful optimizer which selected the best sub-set features to be included in the RBML algorithms. It has been inferred that by different ML algorithms, the prediction models have shown more promising performance compared to other traditional approaches. Hence, ML algorithms can construct complex models and make reliable decisions when fed by appropriate features. Conclusion The evaluation of the selected ML technique's performance demonstrates the suitability of the J-48 for predicting BC risk. Our proposed predictive model for BC discriminates persons at high and elevated risk for BC and non-BC cases based on the most important variables and can be used as an essential and non-invasive clinical screening tool for the early identification of BC. Acknowledgments: We thank the Research Deputy of the Abadan University of Medical Sciences for financially supporting this project. Ethical Permissions: The Abadan University of Medical Science ethics board approved the study design (Ethics code: IR.ABADANUMS.REC.1400.040). Conflicts of Interests: This article is extracted from a research project supported by the Abadan University of Medical Sciences. Authors’ Contributions: Nopour R. (First author), Methodologist/Statistical Analyst (40%); Kazemi Arpanahi H (Second author), Introduction Writer/Discussion Writer (20%); Shanbehzadeh M. (Third author), Introduction Writer/Methodologist/Assistant researcher (40%). Funding/Sources: The Abadan University of Medical Sciences was support this project.
References

Statistics Article View: 36 PDF Download: 22

Statistics

Number of Journals	45
Number of Issues	2,196
Number of Articles	24,877
Article View	28,765,435
PDF Download	18,602,872