Early Breast Cancer Prediction Using Dermatoglyphics: Data Mining Pilot Study in a General Hospital in Iran | ||
| Health Education and Health Promotion | ||
| Article 15, Volume 9, Issue 3, 2021, Pages 279-285 PDF (575.79 K) | ||
| Document Type: Descriptive & Survey | ||
| Authors | ||
| S.M. Ayyoubzadeh1; A. Almasizand2; Sh. Rostam Niakan Kalhori1; T. Baniasadi3; S. Abbasi* 2 | ||
| 1Department of Health Information Management, School of Allied Medical Sciences, Tehran University of Medical Sciences, Tehran, Iran | ||
| 2Department of Laboratory Science, School of Allied Medical Sciences, Tehran University of Medical Sciences, Tehran, Iran | ||
| 3Department of Health Information Technology, Faculty of Para-Medicine, Hormozgan University of Medical Sciences, Bandar Abbas, Iran | ||
| Abstract | ||
| Aims: Dermatoglyphic is the study of skin patterns on hands and feet. It has been shown in some studies that specific finger patterns could be a risk factor for breast cancer. Thus, this study aims to evaluate fingerprint patterns and other easy-to-obtain features in the risk of breast cancer. Instrument & Methods: This descriptive study was conducted in 2020. A dataset containing 462 records included female patients in Imam Khomeini Hospital Complex, Tehran, Iran. The factors' weight was determined by the Information Gain index. Predictive models were built once without fingerprint features and once with fingerprint features using Naïve Bayes, Decision Tree, Random Forest, Support Vector Machine, and Deep Learning classifiers. RapidMiner 9.7.1 Software was used. Findings: The most important factor determining breast cancer were age, having a child, menopause situation, and menopause age. The best performance was the Random Forest model with accuracy and Area under Curve of a Receiver operating characteristic of 84.43% and 0.923, respectively. The fingerprint patterns feature increased the RF accuracy from 79.44% to 84.43%. Conclusion: An early breast cancer screening model could be built with the use of data mining methods. The fingerprint patterns could increase the performance of these models. The Random Forest model could be used. The results of such models could be used in designing apps for self-screening breast cancer. | ||
| Keywords | ||
| data mining; Risk factor; Breast Cancer; Dermatoglyphics | ||
| Full Text | ||
|
Introduction Breast cancer is the most common type of cancer among women worldwide [1] and the main cause of their death [2]. In most cases, breast cancer is diagnosed too late [3]. Early diagnosis of breast cancer could prevent stage progress and the death of the patients [4]. After the formation of the tumor, breast cancer could be diagnosed through imaging techniques such as soft tissue mammography, breast ultrasound, and magnetic resonance imaging [5]. Techniques with the ability to predict the risk of breast cancer could be helpful for the prevention and successful treatment of this disease. Mammography is the gold standard of screening and could effectively detect cancer in this stage [6, 7]. It has barriers such as costs, fear, embarrassment, self-denial, lack of family support, issues in centers providing screening, culture, and religion [8-13]. The causes of breast cancer are a combination of genetic and environmental factors [3]. Thus, predicting breast cancer even before mass formation could be a solution by using data analysis from cancer-causing agents in other patients and creating models. There are several factors affecting breast cancer addressed in the literature. These factors could be categorized in 1) Socio-demographic characteristics include age, ethnicity, and education 2) Clinical, exogenous hormonal, and menstrual and reproductive factors include breast surgery history, first-degree family history of breast cancer, hormonal replacement therapy history, menopausal status, breastfeeding history 3) Anthropometric and lifestyle factors include BMI, Alcohol intake, Soybean related products intake, and physical activity [14]. One of the less addressed factors in the literature that could be easily obtained is skin patterns of hands and feet. Skin line patterns show congenital abnormalities and might be used for identifying genetic disorders [15] also. They could be helpful to identify women with breast cancer or at risk of breast cancer [16]. A family history of breast cancer may be associated with a specific pattern of fingerprints. However, skin line patterns are genetically controlled but may also be influenced by environmental factors in the first three months of pregnancy to reflect a person's genetics and susceptibility to breast cancer. The fingerprint can be an anatomical, noninvasive and cheap marker to detect susceptible individuals [5]. For women with breast cancer, compared to controls, most patients with breast cancer have a Whorl fingerprint pattern on their six or more fingers [5, 16, 17]. Also, some studies have not found a significant relationship between fingerprint patterns and breast cancer [18]. The fingerprint patterns could be categorized as Loop, Central Pocket Loop (Whorl), Lateral Pocket Loop (Whorl), Pain Arch, Tented Arch, True Whorl (Whorl), Accidental Loop (Whorl), Twinned Loop (Whorl), Double Loop (Whorl) [16]. In general, dermatoglyphics is defined as the study of skin patterns of hands and feet. It is affected by genetics; It could be a risk indicator for diseases [19]. Also, some studies showed they could be used to predict the risk of breast cancer [5, 16, 17]. Recently, Artificial Intelligence (AI) systems have attracted significant attention, especially in the healthcare field. A recent study showed that an AI system could predict breast cancer risk more accurately than average radiology experts [20]. Data mining as a subfield of Artificial Intelligence could be used for prediction purposes [21]; they deal with methods and tools discovering new and hidden knowledge from databases [22]. Due to the importance of breast cancer risk prognosis and detection in the early stages, many studies investigated data mining models in this regard. These studies used multiple data mining techniques, including decision trees, logistic regression, support vector machines, Naïve Bayes [21, 23-28]. Although several studies use AI systems, especially data mining methods, to evaluate the risk of breast cancer [23-25], few studies evaluate finger patterns with data mining for breast cancer risk prediction. Thus, this study aims to assess fingerprint patterns and other easy-to-obtain features in the risk of breast cancer. Instrument & Methods This descriptive-retrospective study was conducted in 2020 and included female patients in Imam Khomeini Hospital Complex, Tehran, Iran. This case-control study contains 308 controls and 154 cases described in the previous study [16]. The sample size was chosen based on previous similar datasets [29, 30]. The samples were randomly selected. Case group patients were included in the study in case of pathological diagnosis of breast cancer in the Cancer Institute (wards 1 and 3 of women and central clinics 1 and 2) of Imam Khomeini Hospital and the control group included healthy women with no history of breast cancer and other neoplastic diseases, and none of their relatives had a history of breast cancer. Women with hysterectomy and artificial menopause or exposed to radiation and chemotherapy during their lifetime were excluded from the study. This study was approved by the Research Ethics Committee of Tehran University of Medical Sciences. Informed consent for the research was obtained from all participants (or their parents / legal guardians) before entering the study. After obtaining informed consent and permission, both patient and control groups were asked to complete a form of demographic information about age, ethnicity, marital status, type of parental marriage, family history of cancer, and smoking. After placing the subject's fingers or thumbs in purple ink, the fingerprint was taken on white paper. Patterns of all ten fingers were analyzed. Fingerprints were classified into nine patterns including Loop, Central Pocket Loop (Whorl), Lateral Pocket Loop (Whorl), Pain Arch, Tented Arch, True Whorl (Whorl), Accidental Loop (Whorl), Twinned Loop (Whorl), Double Loop (Whorl) patterns. A dataset contains 462 records were created. This dataset includes 18 input features and one target feature indicating the presence of breast cancer. The features are represented in Table 1. The value transformation was done from numerical to categorical values for menopause situation, has Child, operation history, family history, finger dermatoglyphic patterns, and has breast cancer features. Missing values were imputed using K-NN (K-Nearest neighbor) method. The age, Menstruation age, first live birth age values are normalized. Information gain is a calculated based on Shannon’s information theory [31] Naïve Bayes classifier uses Bayes' theorem with the assumption of features independence. It predicts the target feature by calculating the most probable value with given input features [32]. A decision tree (DT) is an easy-to-interpret and important classifier in medical research. The tree is shown in a flowchart-like format that each node except leaves indicates a test on a specific input feature. Each branch that exits from the node shows the possible values of the feature. At the bottom of the tree, the leaves show the target feature (class label) with that path in the tree. The decision tree classifier approach is converting the sample space to the highest purity in the leaves. Algorithms such as CHAID, CART, ID3, C4.5 could be used to build this classifier [33]. Random Forest is an ensemble classifier. This classifier is comprised of multiple decision trees built on a subset of features. This model has two main hyperparameters: features count in the subset and the decision tree count in the forest [34, 35]. In the SVM classifier, boundary samples that could separate the samples classes' space, namely support vectors, are determined, and an optimal linear decision boundary is assumed between these boundaries. This boundary separates the samples' space with a hypersurface with a maximum margin [31]. In contrast with traditional data mining classifiers, deep learning can automatically extract features from raw data datasets by using multiple layers [36, 37]. Deep learning could be seen as a subset of artificial neural networks [37]. As the previous study shows, the whorl pattern count could be a predictor factor in breast cancer prediction [16]. The models were built triple times; once without finger pattern features and once with finger pattern features, and again with calculating and adding whorl count feature to the dataset. The performances of the models are measured by calculating the following indices: Accuracy (Equation 1), sensitivity (Equation 2), specificity (Equation 3), and The Area under the Curve of a Receiver operating characteristic (AUC). These indices are measured with the use of test data categorization by a classifier. These categories for a binominal classifier are True Positives (TP), the number of positive instances that the classifier predicts to be positive; False Positives (FP), the number of instances that are not positive but the classifier predicts them to be positive; False Negative (FN), the number of positive instances but the classifier predicts them to be negative; and True Negative (TN), the number of negative instances and the classifier predicts them as negative. The evaluation of these models is calculated using the K-fold cross-validation method (K=10). Cross-validation is a method widely used to facilitate model estimation and variable selection [38]. ![]() ![]() RapidMiner 9.7.1 Software was used to build and evaluate models. Naïve Bayes, Decision Tree, Random Forest, and Support Vector Machine classifiers are built once without fingerprint features and one with fingerprint features. Findings The most important factors determining breast cancer were age, pregnancy experience (or has a child), menopause situation, and menopause age (Diagram 1). ![]() Figure 1) Factors' weight affecting breast cancer classification by Information Gain The best AUC in the first experiment belongs to the deep learning model, and the best AUC in two other experiments belongs to the Random Forest classifier (Diagram 2; Table 2). The fingerprint patterns feature increased the RF accuracy from 79.44% to 84.43%. Also, the highest accuracy and sensitivity belong to this classifier in these two experiments. The Naïve Bayes classifier shows the best specificity (82.85%). ![]() Figure 2) Random Forest classifier ROC curve (with Whorl count feature) Table 2) results of comparison of classifiers' performance by the effect of finger pattern related features (numbers are in percentage) ![]() Discussion This study used data mining techniques to predict the early risk of breast cancer in Iranian women. This prediction is made by using easy-to-obtain features, and thus, it could be performed anywhere and inexpensively. The most important factors predicting breast cancer in our dataset were respectively: age, pregnancy experience, menopause situation, menopause age. Some literature that used data mining methods for predicting breast cancer are reviewed in this section: Senturk et al. used seven different algorithms that were used for the prediction of other patients. Discriminant analysis, multi-layer perceptron, decision trees, logistic regression, support vector machines, Naïve Bayes, K-NN are seven methods for contributing to the early diagnosis of breast cancer. In summary, the best algorithm for breast cancer prediction is the Support Vector Machines algorithm [23]. Padmavathi used RBFNN, MLP, and Logistic regression methods to study the performance of a neural network using breast cancer data. The accuracy of RBFNN was 97.0%, MLP was 91.3%, and logistic regression was 73.7%. The specificity of them was 96.8%, 91.1%, and 72.6%, respectively. Also the sensitivity were 97.3% , 92.1% and 75.5% respectively [24]. Rajesh & Anand have attempted to classify breast cancer data using the C4.5 algorithm to diagnose cancer and distinguish between the different stages. Five hundred random records were used. They found an accuracy of ~94% in the training phase and ~93% in the testing phase [21]. Delen et al. compared three methods to find a better prediction model for breast cancer survivability, and the dataset was more than 200,000 cases. The results indicated that the decision tree (C5) is the best predictor with 93.6% accuracy on the holdout sample. 90.66% and 96.02% were shown to be specificity and the sensitivity of the model, respectively. Artificial neural networks came out to be the second high-performance classifier with 91.2% accuracy, 87.48% specificity, and 94.37% sensitivity, and the logistic regression models came out to be the next one with 89.2% accuracy 87.86% specificity, and 90.17% sensitivity [26]. Alshammari et al. compared C4.5, Naïve Bayes, Multilayer Perceptron, and RBF Network for predicting breast cancer survivability. They revealed that the decision tree (C4.5) algorithm with an accuracy of 89.3%, the sensitivity of 89.1%, and specificity of 98.5% is showed the best performance among the others, and in this study, high accuracy was very important [27]. In another study by Kuo et al., the decision tree model was used to increase the level of diagnostic confidence to classify breast tumors and has been shown to perform with 96% accuracy, 96.67% specificity, and 93.33% sensitivity [25]. Aruna et al. compared data mining methods to find the best classification according to accuracy, sensitivity, and specificity in breast cancer diagnosis. They used methods such as Naïve Bayes, SVM-RBF kernel, RBF neural networks, Decision trees (J48), and simple CART; it was found that the SVM-RBF kernel is a good classifier for this task [28]. The Random Forest model showed the best performance with an accuracy of 84.43% and a sensitivity of 88.92%. This model showed that it might be useful for breast cancer screening purposes. Random Forest showed to be useful in other studies as well [39-41]. For instance, Nguyen et al. [41] showed that a method using Random Forest and feature selection technique could have more than 99% accuracy. Comparing the performance of other studies' proposed models with this study could be misleading; most of the mentioned studies have employed a public dataset (Wisconsin breast cancer dataset [29]) and evaluate their models on that dataset. In this public dataset, the features are based on breast and breast mass features, while in this study, we didn't use any features directly pertaining to the breast. The deep learning method didn't show a great performance compared to other studies [42-44]. This issue could be justified for three reasons 1) as the studies mentioned, the deep learning techniques should be applied to raw data (e.g., images) 2) deep learning is suitable when the dataset size contains many records. The traditional classifiers' performance with a small training dataset could be better compared to deep learning methods. 3) As mentioned earlier, the classifiers are applied to breast images and contains direct information about the breast that could lead to imaging costs and other barriers mentioned in the introduction section. The Decision Tree classifier has shown a great performance (from 96% to 99%) [45]. This study uses relatively easy-to-obtain features, including gender, age, benign or malignant, past disease history, breast change, occupation, vegetarianism, height, and weight. A reason that might affect the performance not to be generalized is the small sample size (220 records) used in this research. We experiment with the use of finger pattern-related features in the classifier's performance. As a result, it showed the finger pattern features could enhance the performance of the classifiers. And defining the whorl count in the features could slightly improve the performance of the classifiers again. Some previous studies suggest that Whorl and loop finger skin patterns could risk breast cancer. Sani et al., research on 100 women with breast cancer in Dhrmais Cancer Hospital, India, showed that most women with breast cancer had Whorl finger skin patterns [46]. In another study in Bosnian-Herzegovinian on 100 women with breast cancer, the authors showed that the loop patterns count is related to breast cancer [47]. Manhas et al., in their review, concluded that the breast could be a low-cost factor for predicting breast cancer [48] that is aligned with the findings of this study. It is recommended to conduct similar studies in a larger sample size and use deep learning methods in order to automatically detect skin patterns. Conclusion An early breast cancer screening model could be built with the use of data mining methods. The fingerprint patterns could increase the performance of these models. The Random Forest model could be used. The results of such models could be used in designing apps for self-screening breast cancer. Acknowledgments: We would like to thank the Vice Chancellor for Research of Tehran University of Medical Sciences for supporting this research. Ethical Permissions: The present study is the result of a research project with the ethical code of IR.TUMS.SPH.REC.1399.007 approved on 2020-04-11. Conflicts of Interests: The authors declare that there is no conflict of interest. Authors’ Contribution: Ayyoubzadeh S.M. (First Author), Introduction Writer/Statistical Analyst/Discussion Writer (25%); Almasizand A. (Second Author), Introduction Writer (15%); Rostam Niakan Kalhori Sh. (Third Author), Methodologist/Assistant Researcher (15%); Baniasadi T. (Forth Author), Methodologist/Assistant Researcher (15%); Abbasi S. (Fifth Author), Methodologist/Main Researcher (30%). Funding/Support: This study has been funded and supported by Tehran University of Medical Sciences (TUMS); Grant no. 98-3-102-45631. | ||
| References | ||
|
| ||
|
Statistics Article View: 32 PDF Download: 24 |
||
| Number of Journals | 45 |
| Number of Issues | 2,171 |
| Number of Articles | 24,674 |
| Article View | 24,435,980 |
| PDF Download | 17,551,274 |