Article Content
Abstract
Cervical cancer has become the third most common form of cancer in the in-universe, after the widespread breast cancer. Human papillomavirus risk of infection is linked to the majority of cancer cases. Preventive care, the most expensive way of fighting cancer, can protect about 37% of cancer cases. The Pap smear examination is a standard screening procedure for the initial screening of cervical cancer. However, this manual test procedure generates many false-positive outcomes due to individual errors. Various researchers have extensively investigated machine learning (ML) methods for classifying cervical Pap cells to enhance manual testing. The random forest method is the most popular method for anticipating features from a high-dimensional cancer image dataset. However, the random forest method can get too slow and inefficient for real-time forecasts when too many decision trees are used. This research proposed an efficient feature selection and prediction model for cervical cancer datasets using Boruta analysis and SVM method to deal with this challenge. A Boruta analysis method is used. It is improved from of random forest method and mainly discovers feature subsets from the data source that are significant to assigned classification activity. The proposed model’s primary aim is to determine the importance of cervical cancer screening factors for classifying high-risk patients depending on the findings. This research work analyses cervical cancer and various risk factors to help detect cervical cancer. The proposed model Boruta with SVM and various popular ML models are implemented using Python and various performance measuring parameters, i.e., accuracy, precision, F1–Score, and recall. However, the proposed Boruta analysis with SVM performs outstanding over existing methods.
1. Introduction
According to a WHO survey, cervical cancer has probably led to cause cancer affecting women in underdeveloped nations [1]. Despite medical centers, there have been thousands of new cases within the USA in 2016, compared to more than 20K morality in 2014. This cervical cancer database comprises more than 800 data sample values, 32 characteristics, and four objectives, which have been reported in the year 2016-17. Essential features include aggregate characteristics, tobacco behaviors, and health records from the past. The several testing and diagnostic procedures that result in an excellent diversity add to the data’s complication. As a result, the vital issue involves predicting the person’s component behavior and determining the optimum screening technique. As a result, the fundamental problem in predicting the person’s component risk assessment is the process of the optimum main channel. Various investigators have examined cervical cancer data collected from different sources [2]. The primary risk factors for cervical cancer transmission are poor menstruation sanitation, adolescent pregnancy, cigarettes, and oral prevention methods. Healthcare datasets have more characteristics and incomplete data than nonmedical datasets. By form of enhancement, it is essential to define the significant and necessary attributes for quantitative model construction. ML techniques are superior in forecasts and performance tuning expeditions, but they have been widely used in cancer and breast cancer research [3]. According to a study [4], long-term HPV infectious disease is the primary cause of cervical cancer.
On the other hand, if diagnosed early and cured correctly, cervical cancer is the most curable type. The technique mentioned above requires more effort to process the information, and obtained low-level features cannot deliver optimal classification efficiency, highlighting the failures of intelligent learning. An ML-based feature extraction approach shares massive advantages over all other cancer detection algorithms in obtaining an improved CAD framework. The ML-based technique accomplishes state-of-the-art findings on complicated computer vision applications [5]. As per existing studies, most cervical precancerous disease classification investigations focus on individual colposcopy visualizations during acetic acid tests, making it challenging to determine cervical cancer. This article focuses on numerous machine learning techniques that can forecast the occurrence of cervical cancer as precisely as feasible, utilizing a fixed number of factors of potential risk determinants for each female. However, the stability of recall and precision is a challenging issue once working to develop a forecasting model with a set of analyses. This research presents a prediction model using machine learning methods to detect cervical cancer analysis. This research proposed an efficient feature selection and prediction model for cervical cancer datasets using Boruta analysis and SVM method to deal with this challenge. This research utilized SVM, random forest, decision tree, and Boruta methods to analyze the cervical cancer dataset. This strongly supports feature classification, regression, clustering, and survival analysis with more modeling methods.
The research work [6] involves the identification of accurate indicators from the UCI dataset that can act as powerful predictors of cervical cancer and a dependent variable that may be a function of these predictions for visualizing and analysis of the cancer trends. Multiple models may be built to find the indicators that can help understand the dynamics of the various variables. The performance of the proposed model and existing ML model is verified using an online cervical cancer dataset using Python and different version measuring parameters, i.e., accuracy, precision, F1 score, and recall. This research is aimed at developing mathematical equations and applying Boruta analysis to depict two types of cervical cancers: (a) low-risk and (b) high-risk cancer. First of all, the cervical cancer dataset has been identified, and the preprocessing has been performed on the dataset, followed by correlation analysis and Boruta analysis. After this, causal analysis has been done that helps identify factors that contribute to cervical cancer. The workflow includes making hypotheses that will be further verified and validated by the results.
The complete research work is organized as follows: Section 1 covers the cancer-related introduction work. Section 2 covers the review of existing research and also suggested a comparative analysis of various methods for cancer research. Similarly, Section 3 covers the materials and techniques, Section 4 covers experiments and results analysis, and finally, Section 5 covers the conclusion and future directions of the research.
2. Literature Review
This research presents a machine learning method-based model for earlier cervical cancer prediction in the early stage. This section represents the review of various machine learning models for earlier and more accurately cervical cancer detection. The review work is divided into three subsections based on the risk factor, a mathematical model, and machine learning methods.
2.1. Based on Risk Factors for Cervical Cancer
The “National Comprehensive Cancer Network” has issued a warning about the benefits of initial identification of cervical cancer. In contrast, a postponement in treatment is the leading cause of an increasing number of women mortality globally. As a result, numerous scientific and medical investigations have investigated the causes, symptoms, and methodologies of identifying and avoiding cervical cancer. Researchers have also attempted to evaluate the risks that contribute to the pathogenesis and progression of this particular cancer. The selected research works are as follows.
In the research article [7], the cure for cancer has usually taken numerous forms over the years; total elimination may not even be possible; however, the disease’s probability of occurrence and forecasting can be reduced. Any disorder can be healed if identified in its beginning phases, and cancer can be successfully treated if spotted in its beginning phases. On the other hand, cervical cancer is hard to forecast in its early stages because there are no symptomatic. The frequent test is done for such forecasting of cancer cells because testing has been the only way it can be forecasted [8]. In [9], to avoid such uncertainties, screening outcomes may be supervised as false positives at points in time, or they may be postponed. Machine learning has been developed in the field of health care services. Numerous methods, techniques, and technology have been used to anticipate cancer cells quicker and with a lower false-positive rate.
The method of mathematical modeling aids in the comprehension of the observable occurrence. The visible event in the healthcare area [10] could be wellness symptoms and perhaps a sickness, and this technique results in a workable characterization of complicated things. Inside the medical sciences, the mathematical formulation has also been utilized in various methods to solve, reproduce, research, and explain biological mechanisms [11]. The research [12] proposes probabilistically mathematical systems when the sample sizes are limited and can thoroughly examine the parameters. According to the researchers, any healthcare system may comprehend via comparisons; then, such a procedure must influence the mathematical framework [13]. As illustrated, a model named three separate structures might be used to understand the number of carbohydrates stored in human bodies. Other researchers prefer to use informative computational methods. These models use a feasible description of factors in analytics testing to describe realistic circumstances [14]. In social and epidemiology investigations, description methods are essential. In most cases, the means, median, average, standard deviation and variance, and other statistics are determined, and a report of the phenomena is written down. Table 1 represents the summary of existing research work based on cancer risk factors.
| Article | Risk factors discussed | Imported feature (age group) | Possible cancer types |
|---|---|---|---|
| [15] | Human papilloma-virus (HPV) infection | 18-35 | Cervical cancer, breast cancer |
| [16] | Sexual history | Under 18 and above | Carcinoma, cervical cancer |
| [17] | Smoking | All age groups | Lung, cervical, and breast cancer |
| [18] | Weakened immune system | 30-60 | Carcinoma, cervical cancer |
| [19] | Chlamydia infection | All age groups | Carcinoma, cervical cancer |
| [20] | Oral contraceptives do with a long period (birth control pills) | 18-50 | Cervical cancer, lung |
| [21] | Several full-term pregnancies | 18-40 | Cervical cancer, lung |
| [22] | First full-term pregnancy at a young age | 25-60 | Cervical cancer, lung |
| [23] | A diet deficient in fruits and veggies | 22-56 | Cervical cancer, lung |
| [24] | Smoking and HPV | 11-60 | Cervical cancer, lung |
| [25] | Use of pills (pregnancy) | 22-45 | Cervical cancer, lung |
| [26] | Early pregnancy, HPV | 13-18 | Cervical cancer, lung |
| [27] | HPV and weaker immunity | 18-50 | Cervical cancer, lung |
2.2. Based on Mathematical Models
Furthermore, more examination into cervical cancer using mathematical models indicates that significant teams of investigators in the medical sciences concentrate on diagnostics modeling models [28]. The experts in clinical forecasting use a variety of strategies to construct models. Analysis technique and supervised learning model are two examples. Specific healthcare computer models are referred to as “forms of modern.” Basic logical reasoning, hypotheses, concepts, and descriptive analysis have created these frameworks. Many researchers usually refer to such algorithms as medical condition recognition systems [29]. They also utilized ML algorithms to predict serious health issues by the researchers. Enzyme kinetics and pharmacokinetics are two necessary fields of medical research [30]. Machine learning algorithms and automatic analyses are frequently used in several areas of medicine. Physiological reactions and parameters like stress levels, heartbeat, and others must be recorded and modeled for tracking medical conditions within time-series modeling techniques [31]. Modeling, which enables to comprehension of dynamic interaction, uses an approach called transferring characteristics for a detailed look. This type of procedure keeps track of feedback and the processes between this. Many researchers have looked at the principal source of such medical conditions while discovering and establishing the mathematical determinant factors.
Nevertheless, the issue is mainly identifying acceptable factors that can describe the specialized clinical paradigm or phenomenon and determining which independent variables may operate as potential forecasters and which characters can describe the entire computational formula [32]. All of the clinical models presented thus far depend on a fundamental grasp of the mathematical model development. Depending on the concerns and obstacles described in the present research, this next section considers the frame of the activity. Table 2 represents the review of cancer types based on several features and age group impact.
| Article | No of features selected | Imported feature (age group) | Possible cancer types |
|---|---|---|---|
| [33] | 13 parameters | 18-40 | Cancer type 1 and type 2 |
| [34] | 10 parameters | 20-50 | Cervical cancer, lung cancer |
| [35] | 12 parameters | 18-55 | Cervical cancer, skin cancer |
| [36] | 10 parameters | 18-45 | Cervical cancer type 3 |
| [37] | 15 parameters | 20-50 | Cervical cancer, breast cancer |
| [38] | 7 parameters | 18-30 | Cervical cancer, breast cancer |
| [39] | 10 parameters | 17-30 | Cervical cancer, lung cancer |
| [40] | 18 parameters | 14-60 | Cervical cancer, type 2 and 3 |
| [41] | 12 parameters | 15-55 | Cervical cancer, type 1 |
2.3. Based on Machine Learning Models
In this research, machine learning techniques have been employed to detect cervical cancer accurately via constructing a framework affected by previous research methods in a similar domain. Research [42] proves that by utilizing the oversampling process performance of existing approaches can be improved. This research used the random forest to build a classifier predicated on cervical cancer cases. The analysis indicates that the RF significantly outperformed its same framework after implementing SMOTE, including all characteristics of cervical disease variables in the forms of parameters, i.e., accuracy, specificity, precision, and true positive rate. The research [43] used the online UCI dataset with various strategies for cervical cancer diagnosis: (a) SVM, (b) SVM with PCA, and (3) SVM with RFE. This article concluded that SVM performs well and achieves better precision, diagnostic accuracy, and precision than the multiple different classifiers.
Research [44] utilized three forms of machine learning models to categorize the UCI cervical cancer data. The proposed model used a “border row hierarchical clustering” (BRHC) to deal with dataset inequity. This research has observed that the XG-Boost and random forest methods perform outstandingly in cancer prediction accuracy rates. Since this cancer data contains many incomplete, missing data, it is necessary to deal with missing attributes carefully. Research [45] offers four distinct methods to deal with missing values in the cancer dataset. These techniques are NOCB, LOCF, FVM, and NOCB. To anticipate the biopsy input variables, they utilized six algorithms: LR, RF, SVM, DT, NB, and NN [46], and researchers also concluded that if used with the NOCB preprocessing phase, the SVM, as well as LR, reached the best accuracy, F1 measure, and precision. In this research, machine learning techniques have been employed to detect cervical cancer accurately via constructing a framework affected by previous research methods in a similar domain. The private database was created using 472 survey questions from a China health center, so each cancer patient who took the poll had a correlating gene sequence set of data. This research collects the data from “Mexico’s Maggiore de Caracas health center.” This dataset contains 592 cancer patients’ data with various attributes. This research applied a pooling and discussed the difficulties associated with conventional cervical cancer diagnostics. Table 3 represents the comparison of research methods based on ML methods.
| Article | Technique utilizes | Type of cancer | Important feature discussed | Dataset used | Validation technique |
|---|---|---|---|---|---|
| [47] | Artificial neural network | Cancer in breast | Age and mammography results | Diagnostics data and pathological data | Crossvalidation 10-fold |
| [48] | Support vector machine | Cancer multiple myeloma | STAT1, BRCA1, and CCND1 CCNB1 | Online UCI | Crossvalidation 20-fold validation |
| [49] | Random forest | Cervical cancer | Diet, eating habits, and BME | Clinical data | Crossvalidation 10-fold |
| [50] | BN methods | Lung cancer | BP, age, and other parameters | Kaggle online dataset | 10-fold crossvalidation |
| [51] | SVM | Cervical cancer, breast cancer | Skin type, breast size, and skin color | Dataset from the hospital (China) | Clinical survey data |
| [52] | Boruta | Cervical cancer, lung and breast | Age, infection type | Clinical survey data | Crossvalidation |
| [53] | SVM with random forest | Cervical cancer, cancer in lungs | BME | UCI online dataset | 10-fold crossvalidation |
| [54] | K-NN, SVM | Cervical cancer | Age and mammography results | UCI dataset | Crossvalidation 10-fold |
Machine learning approaches have been utilized in this investigation to correctly identify cervical cancer via developing a structure influenced by prior research methodologies used in a similar field. The public available UCI dataset on cervical cancer does not have per-annotated rows that give a confirmatory signal about the presence or absence of cervical cancer. The dataset aims to understand the subjects that influence a cervical cancer diagnosis.