Evaluating Academic Performance of Students Learning in Open University
MSSP 608 Practical Machine Learning Methods Final Project
By Jia Xu, Yamei Lu, YufeiQin in Machine Learning Python
December 7, 2021
Our code is based on Google Colaboratory platform using Python language. Colab Python Code Link: CLICK HERE (May not be available if viewer is in China. Please visit HERE for more information.)
Introduction
Predicting students' academic performance at school using regression methods is not a new area of interest. Machine learning methods, however, are relatively new in this field and it has been flourishing in recent years. According to Ghorbani and Ghousi (2020), due to technological advancements, predicting students’ performance at school is among the most beneficial and significant research topics nowadays. Therefore, we believe that it is a meaningful area for us to focus on and we decide to analyze the Open University Learning Analytics Dataset to study the student’s academic performance.
Rastrollo-Guerrero, et al. examined nearly 70 papers in their review published in the year 2020 and concluded that the most widely used technique for predicting students’ performance was supervised learning, as it gives accurate and credible results. To be specific, SVM (Support Vector Machine), DT (Decision Tree), NB (Naïves Bayes), and RF (Random Forest) have also been well-studied algorithmic proposals that gave good outcomes (Rastrollo-Guerrero, et al., 2020). Therefore, in this study, we will hopefully try to examine the Open University Dataset using SVM, DT, Gradient Boost, and RF regression.
Primary Task
For the primary task part, we would like to use the Open University Learning Analytics Dataset to apply our analysis. This is an open dataset collection that collects data from Open University’s presented courses, student demographic data, and aggregated clickstream data of students’ behavior interactions in the Virtual Learning Environment (VLE). This dataset collection contains 7 separate tables, including information about 22 courses, 32,593 students’ information and their individual assessment results, and their interactions with the VLE represented by daily summaries of student clicks (10,655,280 entries) (Kuzilek, J., Hlosta, M. & Zdrahal, Z) .
For our analysis, we would like to mainly focus on students’ performance in different courses (more specifically, their scores won from different classes), and forecast their scores in the future based on the course features and student demographic features. As we are predicting numeric scores, we would apply some different regression methods and compare their prediction performances. Before officially applying different regression models, we have done data pre-processing (like deleting useless columns and calculating new features, replacing null values with reasonable numbers, and using techniques such as one-hot code to transfer nominal features into numeric ones, etc.)
After pre-processing and merging datasets, we have chosen features as listed for regression:
Feature Name | Description |
---|---|
code_module | Code name of the module, which serves as the identifier. |
code_presentation | Code name of the presentation. It consists of the year and “B” for the presentation starting in February and “J” for the presentation starting in October. |
gender | The student’s gender. |
region | Identifies the geographic region, where the student lived while taking the module-presentation. |
highest_education | Highest student education level on entry to the module presentation. |
imd_band | Specifies the Index of Multiple Depravation band of the place where the student lived during the module-presentation. |
age_band | Band of the student’s age. |
num_of_prev_attempts | The number times the student has attempted this module. |
studied_credits | The total number of credits for the modules the student is currently studying. |
disability | Indicates whether the student has declared a disability. |
date_registration | The date of student’s registration on the module presentation, this is the number of days measured relative to the start of the module-presentation (e.g., the negative value -30 means that the student registered to module presentation 30 days before it started). |
module_presentation_length (length) | Length of the module-presentation in days. |
total_click(sum_click) | The number of times a student interacts with the material in that day |
As we were applying regression methods in our design, we chose to judge the model performances based on evaluation indexes (Wu, S.) as below:
Evaluation Features | Description |
---|---|
Adjusted R2 | It measures how much variability in dependent variables can be explained by the model. The adjusted R2 will also penalize additional independent variables added to the model and adjust the metric to prevent over-fitting issues. |
MSE | It is an absolute measure of the goodness for the fit and calculates an absolute number on how much your predicted results deviate from the actual number. |
RMSE | It is the square root of MSE. It brings it back to the same level of prediction error and is easier for interpretation. |
Support Vector Regression
The multiple Linear regression using all our features selected is doing badly, as expected, with a R square of around 0.35. We will next use machine learning regression and models to see if our predictions improve.
Support Vector Regression seeks not to minimize the squared error as in the linear regression, but to minimize coefficients. Compared to the original multiple linear regression model, Support Vector regression has an adjusted R-square of around 0.39. The training error of these two models is basically the same: approximately 23.18 for Support Vector Regression and about 23.91 for the multiple linear regression. Overall, this is a rather slight improvement, and we may need to further investigate a better model.
Decision Tree
Decision Trees are divided into Classification and Regression Trees. Regression trees are needed when the response variable is numeric or continuous. In our case, we exploited the decision regression trees as we have numeric variables in our feature set.
Unlike the Support Vector regression, our Decision Tree regression shows a significantly better result compared to the multiple linear regression. The training error (RMSE) for the Decision Tree model is around 17 and the adjusted R square is around 0.67. As the multiple linear regression explains about only 23% of the variation in our model, we witnessed an approximately 191% increase about the adjusted R square. Therefore, we can say that the decision tree model is doing a relatively good job.
Gradient Boost
Gradient Boost is a popular machine learning algorithm that tries to find the best model prediction performance step by step (using an ensemble of weak predictive models, and the ensembled model can have better performance compared to individual ones). In our model, Gradient Boost can be called as the best model with a relatively small square root of prediction error (RMSE) of 17.69, and relatively small validation error (mean RMSE = 18.34, SD = 0.25). And the adjusted R2 shows that the model can explain 64.5% of the total variance.
Random Forest
A random forest is a meta estimator that fits several classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The result of the Random Forest regression model looks great. Training error (RMSE = 16.94) is not much lower than validation error (mean RMSE = 18.67, SD = 0.34).
Ensemble Learning
To better conclude which regression model is performing better, we also applied ensemble learning techniques in regression here (which is widely accepted as a method for combining predictions from different regression models). To better improve the precision, we chose the stacking ensemble method, which assigns different weights to different individual models, and then fit a new linear regression model, which takes predictions from four regression models as features. The newly produced ensemble learning linear regression model’s intercept and coefficients are as below:
Intercept | Coefficient 1 | Coefficient 2 | Coefficient 3 | Coefficient 4 |
---|---|---|---|---|
0.606318586476 | -0.13609429 | 0.03771186 | 0.70655808 | 0.38731542 |
Therefore, we can see that predictions from the Gradient Boost model have been assigned the largest weight, and predictions from the Random Forest model have been assigned the second largest weight, which proves the good judgment of Gradian Boost prediction results made in the primary task.
Best Regression Model Evaluation
To do this we need to have a look at how the model performed on the training set and in cross-validation.
Model Performance | Support Vector Regression | Decision Tree | GradientBoost | Random Forest |
---|---|---|---|---|
RMSE | 23.17564056905 | 17.04841050791 | 17.68650127842 | 16.94388847387 |
Adjusted R2 | 0.391004912092 | 0.670452452090 | 0.645322077665 | 0.674480908668 |
Mean RMSE | 23.47847806255 | 20.15138032344 | 18.34257947524 | 18.66780891993 |
SD | 0.232164064898 | 0.459108943933 | 0.249612540275 | 0.339479772941 |
Based on the offered features and the result from the ensemble learning, it can be seen from the table that GradientBoost is our best model, even though the error for the training set without cross-validation for the GBoost is higher than for the Random Forest. Random Forest had the lowest RMSE score for the training set without cross-validation, but this model has the problem of overfitting, as this model’s performance degraded in the cross-validation, and the SD is relatively high.
Extension Task
Task 1: Annotation Task
We summarized our annotation task in the table below.
Annotation Task | Annotation Description | |
---|---|---|
1 | Variable: Score | If the score is $< 40$, it is a fail, $>= 40 \text{and} <=100$, it is an pass, and if the student does not submit the assessment, no result is recorded. All null scores can be interpreted as non-submissions, so we will fill them out with zeros; null scores will be assigned as passe, as it is ok that most submissions do not fail. |
2 | New Variable: Weighted score | How it will be calculated: Multiply the weight of the assignment with its score. Aggregate the data frame per weight * score per module presentation with the sum function. Calculate the total recorded weight of the module calculate weighted scores - divide summed weight*score by total recorded weight of the module |
3 | New Variable: Late submission | Calculate the rate of late submission for the assignments that the student did submit. How will be calculated: Calculate the difference between the deadline and the actual submission date. Make a new column - if the difference between dates is more than that), the submission was late. Aggregate by student ID, module, and module presentation. |
4 | Merge dataframes | $VLE + VLE materials = \text{total_click_per_student}.$ We can merge these two tables with an inner merge as resources with no activity for any student to provide zero information. We will drop week_from, week_to, and date columns. $Registration Info + Courses + Student Info = regCoursesInfo.$ We will inner merge these three tables based on code_module, code_presentation, and id_student. $Assessments + Results = assessments.$ We will inner merge these three tables based on id_assessment. |
5 | Missing values | \textbf{IMD band}: Fill them according to the most frequent band for that region. \textbf{Date registration}: For the withdrawn students, we will subtract the median value from the registration date to fill these. \textbf{total_click}: We will replace them with 0s (meaning not interested in it). \textbf{weighted_score}: We will replace the nan values with 0s (meaning do not make submissions). \textbf{late_rate}: We will replace the nan values with 1.00 (100% late rate). \textbf{fail_rate}: We will replace them with 100% (1.0) (meaning do not make any submissions). |
6 | Drop columns | The is_banked column is dropped along with date_submitted and assessment_type. |
Task 2: Fairness Audit
Fairness is a topic we often mention in recent days, and features with sensitive fairness characteristics may also have an impact on the model prediction. According to reports from UK universities, the degree-awarding gap for Black, Asian, and Minority Ethnic (BAME) students is 13 percent, with similar effects seen when comparing students across other protected attributes such as gender or disability (Bayer et al., 2021). Typically, demographic information such as age, ethnicity, nationality, and religious beliefs are commonly used in most circumstances.
In the Open University dataset, we found these features are suspicious and worth additional analysis: gender, region, highest_education, imd_band, age_band, disability. Based on the previous regression results, we separated our features into two groups: group 1 only contains fairness-related features as listed above, and group 2 only contains fairness-disrelated features, and group 3 is the original feature set that contains all features. Then we made 3 individual Gradient Boost models and correspondingly applied different feature groups (we chose Gradient Boost because it had been proved to be the best performed regression model during our analysis). The results are shown as below:
Group 1 (Fairness-related) | Group 2 (Fairness-disrelated) | Group 3 (Complete) | |
---|---|---|---|
RMSE | 29.46174938633362 | 18.118993632790925 | 18.381595870859417 |
Adjusted R2 | 0.0227914023132904 | 0.630394553567488 | 0.6192524105918902 |
It can be seen from our results that deleting fairness-related features in our dataset basically very slightly improves the model performance (as the adjusted R2 slightly increased, while the square root of MSE slightly decreased). But as the adjustment is pretty subtle, we cannot conclude here that fairness-related features have some impacts on our model. However, it is always rigorous to consider these aspects and modify our model to be more realistic, and can predict more precisely.
Task 3: Literature Review
Using machine learning models to predict the students’ academic performance has profound meanings. Dabhade et. al. (2021) maintained that these machine learning algorithms can help provide quality advice to educational institutions so that students can enhance their academic performance.
Various techniques were used worldwide in this area. In India, Haridas et al. (2020) adopted machine learning models to predict the students’ performance of an intelligent tutoring system called AmritaITS. They concluded that the prediction models for summative assessments were considerably improved by formative assessments scores and AmritaITS logs (Haridas et al., 2020). In addition, Chui et al. (2020) adopted an improved conditional generative adversarial network-based deep support vector machine (ICGAN-DSVM) algorithm to predict students' performance under supportive learning via school and family tutoring.
In our models, the Decision Tree and Random Forest prediction models received relatively good results for predicting the students’ performance. Fortunately, this result is not alone in this field. During Canagareddy D. et al. (2019)’s research about a machine learning model to predict the performance of University students, they concluded that their best prediction model is Random Forest. Also, Decision tree models managed to behave as one of the first-tier models for predicting students’ final performance at an early stage (Tanuar, E. et al., 2018).
References
Bayer, V., Hlosta, M., & Fernandez, M. (2021, June). Learning Analytics and Fairness: Do Existing Algorithms Serve Everyone Equally?. In International Conference on Artificial Intelligence in Education (pp. 71-75). Springer, Cham.
Canagareddy D., Subarayadu K., Hurbungs V. (2019) A Machine Learning Model to Predict the Performance of University Students. In: Fleming P., Lacquet B., Sanei S., Deb K., Jakobsson A. (eds) Smart and Sustainable Engineering for Next Generation Applications. ELECOM 2018. Lecture Notes in Electrical Engineering, vol 561. Springer, Cham. https://doi-org.proxy.library.upenn.edu/10.1007/978-3-030-18240-3_29
Chui, K. T., Liu, R. W., Zhao, M., & De Pablos, P. O. (2020). Predicting students’ performance with school and family tutoring using generative adversarial network-based deep support vector machine. IEEE Access, 8, 86745–86752. https://doi.org/10.1109/access.2020.2992869
Dabhade, P., Agarwal, R., Alameen, K. P., Fathima, A. T., Sridharan, R., & Gopakumar, G. (2021). Educational data mining for predicting students’ academic performance using machine learning algorithms. Materials Today: Proceedings, 47, 5260–5267. https://doi.org/10.1016/j.matpr.2021.05.646
Ghorbani, R., & Ghousi, R. (2020). Comparing different resampling methods in predicting students’ performance using Machine Learning Techniques. IEEE Access, 8, 67899– 67911. https://doi.org/10.1109/access.2020.2986809
Haridas, M., Gutjahr, G., Raman, R., Ramaraju, R., & Nedungadi, P. (2020). Predicting school performance and early risk of failure from an intelligent tutoring system. Education and Information Technologies, 25(5), 3995–4013. https://doi.org/10.1007/s10639-020-10144-0
Kuzilek, J., Hlosta, M., & Zdráhal, Z. (2017). Open University Learning Analytics datasets. Scientific Data, 4.
Kuzilek, J., Hlosta, M. & Zdrahal, Z. Open University Learning Analytics dataset. Sci Data 4, 170171 (2017). https://doi.org/10.1038/sdata.2017.171
Rastrollo-Guerrero, J. L., Gómez-Pulido, J. A., & Durán-Domínguez, A. (2020). Analyzing and predicting students’ performance by means of Machine Learning: A Review. Applied Sciences, 10(3), 1042. https://doi.org/10.3390/app10031042
Tanuar, E., Heryadi, Y., Lukas, Abbas, B. S., & Gaol, F. L. (2018). Using machine learning techniques to earlier predict student’s performance. 2018 Indonesian Association for Pattern Recognition International Conference (INAPR). https://doi.org/10.1109/inapr.2018.8626856
Wikipedia contributors. (2021, September 12). Protected group. In Wikipedia, The Free Encyclopedia. Retrieved 00:28, November 30, 2021, from https://en.wikipedia.org/w/index.php?title=Protected_group&oldid=1043946666
Wu, S. (2021, June 5). What are the best metrics to evaluate your regression model? Medium. Retrieved December 17, 2021, from https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b
- Posted on:
- December 7, 2021
- Length:
- 13 minute read, 2612 words
- Categories:
- Machine Learning Python
- Tags:
- Project
- See Also:
- Yelp Challenge 2019