Prediction of Undergraduate Student’s Completion Status using MissForest Imputation in Random Forest and XGBoost Models
Date
2021Author
Nirmala, Intan
Wijayanto, Hari
Notodiputro, Khairil Anwar
Metadata
Show full item recordAbstract
All tertiary education institutions in Indonesia are required to report data about their education process to Higher Education Database (PDDikti), Ministry of Education, Culture, Research, and Technology. One of the reported data is about student completion status: whether graduate or drop out. The number of higher education graduates is calculated based on this completion status. However, many undergraduate students have reached the maximum length of study, but their completion status is unknown. This condition makes it difficult to calculate the actual number of graduates.
Prediction of student completion status can be carried out with classification models. In this study, the unknown completion status of undergraduate students was predicted by ensemble trees, which were Random Forest and XGBoost. The data used in this study was incomplete, and the proportion of the missing data was 20.9% of the total data. Missing data may cause bias in the parameter estimates of analysis. This study used MissForest imputation to overcome the missing data. MissForest works based on the Random Forest algorithm. As a comparison, Mean/Mode and Median/Mode imputation was also carried out in this study.
This study aims to evaluate the performance of MissForest imputation, Mean/Mode imputation, and Median/Mode imputation to overcome missing data in predicting the completion status of undergraduate students who have reached the maximum length of study. Best model selection was also conducted between Random Forest and XGBoost models. There were 5 (five) goodnesses of fit employed in this study, namely accuracy, sensitivity, specificity, G-Mean, and AUC. The classification model with MissForest was significantly superior compared to the other two imputations. Regardless of the imputation method used, Random Forest and XGBoost's performance was significantly different. On the data imputed by MissForest, the average of XGBoost’s performance was better than Random Forest’s Performance.
The best model chosen was XGBoost with MissForest imputation, which had an average accuracy of 94.42%, sensitivity 90.21%, specificity 95.67%, G-Mean 92.90%, and AUC 97.77%. This model was used to predict the unknown completion status of undergraduate students who have reached the maximum length of study. The prediction results showed that 62.1% of 1,502 students were dropouts, and the remaining were graduates.