COMPOSITE CLASSIFIER MODEL BASED ON EXTREME GRADIENTBOOSTING FOR EARLY DETECTION OF CERVICAL CANCER
Keywords:
Cervical Cancer,, Composite Classifier,, XGBosst,, Pipeline,, Machine LearningAbstract
Over the past few decades, cervical cancer has become one of the most common health issues affecting women worldwide, particularly in developing countries. Cervical cancer is a type of cancer that starts in the cervix, and it usually begins in the cells on the surface of the cervix. It has no noticeable signs when it first appears and can be deadly. However, if detected at an early stage, cervical cancer is treatable and curable. The high increase in the number of cervical cancer cases and deaths resulting from late detection is the motivation behind this study. This study aims to develop a model that helps women calculate the risk of cervical cancer based on their demographic information and medical history. The dataset used in this study was from Hospital University de Carcus and was obtained from the UC Irvin Repository. The dataset consists of demographic information, living habits, and medical history of 858 patients. Object-oriented analysis and Design methodology was adopted in this study for modularity and system design. A pipeline was implemented to build a composite classifier as a chain of transformer, sampler, and estimator, aiming to prevent overfitting, data leakage, and enhance the proposed model's performance. To build the composite classifier, StandardScaler was used as a transformer, SmoteTomek as the sampler, and XGBoost ensemble learning classifier as an estimator. A conventional XGBoost classifier was trained to identify the top 12 features that impacted the performance of the classification model. Performance metrics, including recall, accuracy, precision, and F1-score, were used to evaluate the model’s performance. The proposed model correctly identifies 100% of women at risk of developing cervical cancer with an accuracy, precision, and recall of 99%, 89%, 100%, and 94% respectively. Based on the developed classification model and its respective results, the proposed model, which consists of a hybrid sampling technique of SmoteTomek with the XGBoost ensemble learning classifier as an estimator in a pipeline, demonstrates a significant improvement in identifying women at risk of developing cervical cancer, thereby tremendously reducing type II error, which occurred due to misclassification.