Enhancing Multivariate Linear Mixed Model with Two-Level Random Effects for Longitudinal Data and Its Application to PISA and National Examination Scores
Date
2024Author
Santi, Vera Maya
Notodiputro, Khairil Anwar
Indahwati
Sartono, Bagus
Metadata
Show full item recordAbstract
Increasing the breadth to involve multivariate responses in cross-sectional data analysis becomes essential when examining interconnected and multidimensional phenomena. Including multivariate responses enables researchers to examine relationships and dependencies among numerous variables concurrently. The necessity to analyze multivariate responses in cross-sectional data emerges when researchers seek to understand complex patterns, dependencies, and interactions among various factors within a specific population.
In actual scenarios, a series of observations is frequently taken multiple times, a phenomenon commonly known as repeated measurements. Repeated measurements entail evaluating a research subject at different observation points in time. Several researchers have devised alternative methods for analyzing longitudinal data. The linear mixed model is the more prevalent approach (LMM).
Another challenge in longitudinal data analysis arises from the presence of multivariate response variables, where multiple responses are observed in the survey. One approach to tackle this is by simultaneously modeling them using a multivariate linear mixed model. The complexity of real-world data sets further increases when random effects are of a multilevel or hierarchical nature. Substantial research questions emerging in longitudinal data structures, where the response variables are multivariate and the random effects are multilevel, often involve issues such as interrelated changes between response variables.
An example of hierarchical data with multiple observations is the PISA scores and national exam (UN) data for high school (SMA) students. This research is conducted repeatedly to observe the trend of average PISA scores of Indonesian students and the average national exam scores at the high school level. PISA score data is designed to include a representative sample from the target population, which is the result of a survey conducted by the OECD. There are three test subjects in the OECD survey: reading literacy, mathematics, and science. Meanwhile, the average SMA national exam scores data is sourced from the Basic Education Data. (DAPODIK).
Cross-sectional studies have also been conducted to analyze PISA score data. In this case, it is crucial to capture whether there is any association or relationship between students' PISA scores across the three test subjects and also the school as the random effect. Furthermore, longitudinal studies have also been conducted. This study is applied to the average national exam (UN) scores data for high school students majoring in science in the West Java province. In this longitudinal study, it is essential to capture trends among the average UN scores for six subjects: Indonesian language, English language, mathematics, physics, chemistry, and biology over time. Another important aspect is to ascertain whether schools contribute as a random effect to the response variables.
Empirical analysis using MLMM has been conducted for both cross-sectional and longitudinal data structures. Although PISA score data has been the subject of several studies in Indonesia, simultaneous analysis of all three PISA score data using MLMM has not been done before. Similarly, simultaneous analysis of average UN scores for high school students majoring in science in West Java with longitudinal data structure and the addition of random effects using MLMM has not been conducted. The results indicate that with the addition of random effects in the model, it is possible to reduce the standard errors of the fixed parameter estimates. Additionally, for PISA score data, schools as random effects contribute significantly to the variance. This finding is consistent with empirical studies conducted on the average UN scores for high school students majoring in science in West Java. The estimates of the variance components obtained show significant variation between schools regarding the average UN scores for the six subjects.
To handle complex data structures involving multiple outcomes and two-level random effects in longitudinal data, MMLMM (Multivariate Multilevel Linear Mixed Models) has been developed. In this model, the random effects are assumed to follow a normal distribution, and the measurement time is neither considered a fixed effect nor a random effect. Parameter estimation for fixed effects is carried out using the Maximum Likelihood Estimation (MLE) method, while the estimation of variance components utilizes the Restricted Maximum Likelihood (REML) method.
The performance of MMLMM is assessed using the properties of the estimators, namely relative bias and Root Mean Square Error prediction (RMSEp), which are evaluated through simulation studies. The results indicate that the relative bias value produces only a slight bias in all of the parameter estimators. The relative bias will decrease and approach 0 as the sample size increases. The RMSEp value also becomes smaller as the sample size increases. Thus, the proposed model shows higher performance with large sample sizes. This indicates that the proposed model that was built produces fixed parameter estimates that are unbiased, have minimum variance and consistent. Based on the simulation study results, it is evident that the performance of MMLMM is superior to that of MLMM.
The subsequent empirical study is conducted by applying MMLMM to the average scores of national exams for senior high schools specializing in natural sciences in West Java Province. The results indicate that the standard error of nearly all estimated parameter values from the MMLMM analysis is smaller than that of MLMM. It is revealed that the proposed model resulted in much lower AIC and BIC when compared to the common model. This is strong evidence that MMLMM's performance is superior to MLMM.
In this dissertation, we have partially estimated the fixed effects of the model and solely conducted analytical parameter estimation for the fixed parameters. In future studies, there will be a challenge to develop computational algorithms and simulations that enable simultaneous estimation of both fixed and random effects, as well as completing the estimation of variance components analytically