LUNG CANCER RELAPSE PREDICTION USING PARALLEL XGBOOST

- Lung cancer has been the most popular form of cancer for decades. Surgery will offer the non-small cell lung cancer (NSCLC) patients the best hope of a cure if the cancer is diagnosed in the early stage. However, many patients eventually die of their disease due to relapse after surgery. Because of no symptoms of lung cancer in its early stage, many researchers try to improve methods to predict lung cancer relapse early. This study proposed a method to predict lung cancer relapse more accurately. This method has three stages; feature selection, parallel eXtreme Gradient Boost (XGBoost) classifications with different hyperparameters, and selection stage. It used two datasets of a gene expression microarray for different lung cancer types with its clinical information. The accuracy results of the proposed model are 0.88 and 0.83 for both datasets, which are more accurate than the represented machine learning. This multi-construction of the parallel XGBoost gives the system the flexibility to deal with a broader range of datasets without hyperparameters tuning and within a short time.

ISSN:2222-758X e-ISSN:  Mary Adline Priya [17] proposed an automatic approach to classifying the lung image into a normal case or cancer case by extracting noise from the CT lung file. The histogram analysis is then paired with morphological analysis, and lung regions are derived using thresholding operations. While the study of Adeola O. [18] used a clinical database to classify the patient if he has chronic kidney disease or not using XGBoost. In a previous study [19], we compared multiple current machine learning and found that the XGBoost is the most accurate system in balance and imbalance datasets. This study tried to improve the XGBoost by applying a Parallel XGBoost (PXGB) with different hyperparameters to increase the system variety and decrease the overfitting. The PXGB showed more accurate prediction values for relapse and no relapse state.

II. XGBOOST ALGORITHM
XGBoost is a decision-tree-based ensemble machine learning algorithm; it uses the Gradient Boosting approach to achieve machine learning algorithms, see Fig. 1. Tianqi Chen and Carlos Guestrin developed it. They introduced their work at the SIGKDD conference in 2016 [20]. It provides a parallel tree boosting that quickly and accurately solves many data science problems. In addition, it offers a range of hyperparameters that give fine-grained control over the model training procedure. Figure 1: XGBoost model structure [21] III. LUNG CANCER DATASETS The datasets used in this study are microarray files. The data gathered through microarrays represents the gene expression profiles, which display changes in the expression of several genes simultaneously in response to a given disease or therapy.
Thus, they represent the molecular level states of the cell [6]. This study applied the proposed model on two microarray datasets. Both datasets were downloaded from the National Center for Biotechnology Information site (NCBI).

A. Dataset Information
This study used two gene expression microarray datasets with clinical information. The first one (GSE8894) dataset is an NSCLC type for 138 cases; 3 cases have no complete clinical information, so it becomes 135; 67 cases have lung cancer relapse state, and 68 cases have non-relapse lung cancer cases with their clinical information [22]. The second is (GSE68465) dataset. It also has clinical information and gene expression for 442 cases; after removing the incomplete information cases, 362 cases remain; 205 relapse cases and 157 non-relapse cases [14].

B. Data Pre-Processing
In biological data, it is crucial to clean the data to improve the quality of the data for searching and analyzing. To do that, it runs a process to detect and remove corrupt or inaccurate records from the database. Each record with missing data must be deleted because it is regarded as an irrelevant case and cause inappropriate learning results. The XGBoost classification deals with the numeric representation in the decision class. In contrast, classes in the lung cancer datasets are in nominal representation, like non-relapse / relapse. Therefore, it must change them to numeric representation (0 /1).

IV. THE PROPOSED METHOD
Decision tree-based algorithms are preferred for small to medium-sized structured/tabular files [19]. In our case, the XGBoost succeeded in learning on some datasets with high accuracy but lower in others. That is because of its firm reliance on its hyperparameter setting. This study developed XGBoosts structure to accommodate a broader type of datasets without changing its hyperparameters tuning. This method will be called the PXGB. It has three stages; the Feature selection stage, the parallel XGboost stage, and the selection stage, as shown in Fig. 2. The feature selection stage: used the XGBoost module to rank the feature importance in making the prediction. In XGBoost, after constructing the boosted trees, each feature will be calculated its importance depending on how valuable www.ijict.edu.iq

Iraqi Journal of Information and Communication Technology(IJICT)
Vol. 5, Issue 2, August 2022 ISSN:2222-758X e-ISSN: 2789-7362 the feature was in boosted trees construction. Each time the feature is used in the construction, the higher the importance score has. Thus, the importance score refers to how this feature is valuable or helpful in constructing the trees. The importance score algorithm calculated the importance for each decision tree by counting each feature cause splitting point and improving the performance measure, weighted by the number of observations for which the node is responsible. The feature's importance is then average across all decision trees within the model [17]. In this paper, the importance score threshold setting was all features above zero to take all features affecting tree construction. Each attribute less than this threshold will be neglected because it has no importance score. The GSE68219 features before the selection stage were 22283 features, and after it became 356 features, and so as for the GSE8894 dataset, the features were 54675, but after selection, they became 114 features. The parallel XGBoost stage: After the feature selection stage, New Dataset (NDS) will be used, which has only the effective features. The NDS will be split to 70% for training data and 30% for testing data, and then the training data will be entered into each XGBoost simultaneously. Each XGBoost has its hyperparameter set different from others (shown in Table I); these hyperparameter sets ranged from the most common values that may cause the overfitting to the most common values that may cause the underfitting. This kind of choice leads to having different XGBoost structures to be flexible to deal with various cases and datasets. All the XGboosts are working in parallel to reduce the extra overhead delays in learning time. Then the testing data will be applied to all XGBoost simultaneously to have different probability predictions for lung cancer relapse and no-relapse classes for each XGBoost model.

A. XGBoost Hyperparameters Setting
The PXGB sets the hyperparameters of all XGBoosts as shown in Table I, and each of the original XGBoost, SVM, gcforest, KNN, and Naive Bayes have a particular setting, as shown in Table II.

B. The Comparison of different Classifiers
Applying the PXGB and other machine learning models to the lung cancer datasets has different prediction results for lung cancer relapse probability. The prediction values have four metric types: TP: True Positive, which is in this study the correct prediction of lung cancer relapse.
TN: True Negative, which means the correct prediction of lung cancer no-relapse.
FP: False Positive, which means a false prediction of lung cancer relapse, while it is a no-relapse case.
FN: False Negative, which means a false prediction of lung cancer no relapse, while it is a relapse case.
The metrics used in this research for comparison and analyzing the efficiency of machine learning models are: • Sensitivity: the true positive rate. Also called a recall • Specificity: the true negative rate. Specif icity = T N T N + F P • Precision is the truly detected lung cancer relapse divided by true and false detection of lung cancer relapse.
P recision = T P T P + F P • F1-score : is a harmonic mean of precision and sensitivity. It can be used as a measure of performance of the test for the positive class. x

C. Analyzing Metrics
This study finds the sensitivity, specificity, Precision, F1-score, AUC, ROC, accuracy, standard deviation, and learning time metrics for each machine learning model used in this study to evaluate their effectiveness. When compared between the PXGB and the original XGBoost all its metric is better than it. That is because the PXGB depends on its prediction on different XGBoost structures that led to different probabilities in each case. Then in the selection stage, it chooses the maximum probability, making this algorithm a more accurate prediction than the original one, flexible in dealing with different datasets. The learning times in PXGB are 9 and 16 seconds for each dataset which is better than the original XGBoost (10, 17 seconds, respectively). That is because of the feature selection stage, which minimizes the number of features that cause speed up the learning stage, and although the PXGB has multiple XGBoost, they run in parallel, which leads to not consuming an extra overload to the learning time. Now it will analyze the results of PXGB with other machine learning. The PXGB results have the highest metric values among other machine learning when applied to the GSE8894 and GSE68465 datasets (as shown in Table III), except for the learning time value. The Naive Bayes has completed the learning stage in 1 second for both datasets, while PXGB completed it in 10 seconds to GSE8894 and 17 seconds to GSE68465, but it is still an acceptable value. The standard deviation values represented in Table III are taken from five runs of all machine learnings with different data in the learning stage at each run. It can be noticed that the PXGB's standard deviation value is less than most machine learnings, which indicates that the PXGB model is reliable even when dealing with different data. Furthermore, the results are illustrated in Figs. 5,6,7,8,9,10,11, and 12 as a histogram form. Each one has different hyperparameters to obtain various tree buildings. This variance in hyperparameters setting makes one or more of the XGBoosts perfect for a wide range of datasets when applied. That gives the model the flexibility to be applied to different datasets and has a good prediction. Using the XGBoost algorithm as a feature selection lets only active features associated with the learning process. That led to improving the accuracy and speeding up the learning time.
The results showed that the PXGB achieved better accuracy prediction than other comparative machine learning. It also showed it provides accuracy more than the original XGBoost because it depends on building multi XGBoost structures in its learning stage, allowing it to deal with different datasets without tuning the hyperparameters by letting them have different probability values and choosing the highest one. Moreover, the system learned within a shorter time than the original. Furthermore, its small standard deviation value means it has a stable, reliable accuracy even when dealing with www.ijict.edu.iq