Mohammad Alnassar, Fatema, Blackwell, Tim, Homayounvala, Elaheh and Yee-king, Matthew (2026) Addressing class imbalance in predicting student performance using SMOTE and GAN techniques. Applied Sciences. ISSN 2076-3417 (In Press)
Virtual Learning Environments (VLEs) have become increasingly popular in education, particularly with the rise of remote learning during the COVID-19 pandemic. Assessing student performance in VLEs is challenging, and the accurate prediction of final results is of great interest to educational institutions. Machine learning classification models have been shown to be effective in predicting student performance, but the accuracy of these models depends on the dataset’s size, diversity, quality, and feature type. Class imbalance is a common issue in educational datasets, but there is a lack of research on addressing this problem in predicting student performance. In this paper, we present an experimental design that addresses class imbalance in predicting student performance by using the Synthetic Minority Over-sampling Technique (SMOTE) and Generative Adversarial Network (GAN) technique. We compared the classification performance of seven machine learning models (i.e., Multi-Layer Perceptron (MLP), Decision Trees (DT), Random Forests (RF), Extreme Gradient Boosting (XGBoost), Categorical Boosting (CATBoost), K-Nearest Neighbors (KNN), and Support Vector Classifier (SVC)) using different dataset combinations, and our results show that SMOTE techniques can improve model performance, and GAN models can generate useful simulated data for classification tasks. Among the SMOTE resampling methods, SMOTE NN produced the strongest performance for the RF model, achieving a Region of Convergence (ROC) Area Under the Curve (AUC) of 0.96 and a Type II error rate of 8%. For the generative data experiments, the XGBoost model demonstrated the best performance when trained on the GAN-generated dataset balanced using SMOTE NN, attaining a ROC AUC of 0.97 and a reduced Type II error rate of 3%. These results indicate that the combined use of class balancing techniques and generative synthetic data augmentation can enhance student outcome prediction performance.
![]() |
View Item |
Tools
Tools