Kajian Kinerja Metode Regresi LAD-LASSO dan WLAD-LASSO pada Data Dimensi Tinggi yang Mengandung Pencilan

Cahya, Septa Dwi

dc.contributor.advisor	Sartono, Bagus
dc.contributor.advisor	Indahwati
dc.contributor.author	Cahya, Septa Dwi
dc.date.accessioned	2022-07-20T07:02:36Z
dc.date.available	2022-07-20T07:02:36Z
dc.date.issued	2022
dc.identifier.uri	http://repository.ipb.ac.id/handle/123456789/112678
dc.description.abstract	Least Absolute Deviation (LAD) merupakan salah satu metode alternatif untuk mengatasi pencilan, namun sangat sensitif terhadap pencilan pada peubah penjelas. Weighted LAD (WLAD) telah diperkenalkan untuk menangani pencilan pada peubah penjelas. Pada beberapa bidang penelitian terkadang memuat peubah penjelas jauh lebih banyak daripada banyaknya amatan, dikenal data dimensi tinggi (p≫n). Kondisi ini dapat menimbulkan masalah multikolinearitas, akibatnya ragam dugaan parameter menjadi besar. Best Subset, Stepwise dan Ridge Regression dapat mengurangi ragam dugaan dengan mengorbankan sedikit bias, namun tidak stabil dalam menduga model dan interpretasi modelnya menjadi lebih sulit jika peubah penjelas yang digunakan sangat banyak. Least Absolute Shrinkage and Selection Operator (LASSO) dapat mengatasinya, namun metode ini sensitif terhadap pencilan. Oleh karena itu diperlukan suatu metode yang robust terhadap pencilan dalam seleksi peubah, seperti LAD-LASSO dan WLAD-LASSO yang berdasarkan pada pendekatan LAD. Penelitian ini bertujuan untuk mengevaluasi kinerja metode LAD-LASSO dan WLAD-LASSO pada data dimensi tinggi dan dimensi rendah yang memuat pencilan. Selanjutnya, menerapkan metode terbaik berdasarkan hasil simulasi untuk mengidentifikasi senyawa aktif sebagai penanda antioksidan pada ekstrak daun sembung. Hasil identifikasi ini dapat dikembangkan sebagai bahan antioksidan alamiah dan berpotensi untuk dikembangkan sebagai bahan obat. Untuk mengevaluasi performa metode LAD-LASSO dan WLAD-LASSO dilakukan kajian simulasi pada data dimensi tinggi (n=50,p=100) dan dimensi rendah (n=50,p=10). Selain itu, kedua metode tersebut juga dibandingkan dengan LASSO. Simulasi menggunakan tiga skenario yaitu tanpa pencilan, pencilan pada peubah respon (5%, 10%, 15%), serta pencilan pada peubah respon (5%, 10%, 15%) dan penjelas (20%). Penelitian ini menggunakan penduga Minimum Regularized Covariance Determinant (MRCD) untuk menghitung bobot pada WLAD-LASSO karena penduga ini dapat diterapkan pada data dimensi rendah maupun dimensi tinggi. Metode terbaik berdasarkan hasil simulasi akan diterapkan pada data ekstrak daun sembung untuk mengidentifikasi senyawa penanda antioksidan dalam ekstrak daun sembung. Dalam hal ini, data akan dibagi menjadi 80% data training dan 20% data testing untuk mencari λ optimum berdasarkan validasi silang. Hasil simulasi menunjukkan bahwa, ketika ada pencilan pada peubah respon, peubah signifikan yang terpilih pada WLAD-LASSO dan LASSO cenderung lebih banyak dibandingkan LAD-LASSO. Tingkat kesalahan prediksi WLAD-LASSO dan LASSO lebih rendah dibandingkan LAD-LASSO. Peubah tidak signifikan yang terpilih pada LASSO > WLAD-LASSO > LAD-LASSO. Ketika pencilan juga terdapat pada peubah penjelas, peubah tidak signifikan yang terpilih pada LASSO lebih banyak dibandingkan WLAD-LASSO dan LAD-LASSO. Hal ini disebabkan oleh penggunaan MKT pada LASSO saat terdapat pencilan menyebabkan LASSO cenderung salah dalam menduga koefisien “0”, sedangkan LAD-LASSO dan WLAD-LASSO berbasis pada meminimumkan jumlah absolut sisaan sehingga lebih robust terhadap pencilan pada peubah respon. Bahkan, adanya bobot w_i pada WLAD-LASSO menyebabkan WLAD-LASSO cenderung lebih robust terhadap keberadaan pencilan pada peubah respon maupun peubah penjelas dibandingkan LAD-LASSO. Selanjutnya, performa LASSO, LAD-LASSO, dan WLAD-LASSO pada data dimensi tinggi menurun dibandingkan pada data dimensi rendah. Selain itu, performa ketiga metode tersebut juga cenderung menurun saat korelasi antar peubah penjelas dan tingkat pencilan meningkat. Hasil eksplorasi data ekstrak daun sembung menunjukkan bahwa data tersebut merupakan data dimensi tinggi yang memuat pencilan sekitar 14,28% pada peubah respon dan sekitar 25,71% pada peubah penjelas. WLAD-LASSO diterapkan pada data ini karena sesuai dengan hasil simulasi yang menunjukkan performa yang baik dalam mengidentifikasi peubah signifikan dengan benar. Berdasarkan hasil validasi silang data diperoleh λ optimum metode WLAD-LASSO yaitu 0,008 dengan senyawa/formula yang terpilih yaitu umbelliferone, 12-hydroxyjasmonic acid, C22H14N8O2, dan acetyleugenol. Berdasarkan penelitian ini dapat disimpulkan bahwa WLAD-LASSO berada di tengah-tengah kedua metode tersebut dan paling baik dalam mengidentifikasi peubah signifikan dengan benar (WLAD-LASSO dapat mengatasi kelemahan dari LAD-LASSO yang sangat sedikit memilih peubah yang signifikan dan LASSO yang sangat banyak memilih peubah yang tidak signifikan dalam seleksi peubah). Performa metode tersebut pada data dimensi tinggi menurun dibandingkan pada data dimensi rendah. Selanjutnya, dengan menerapkan metode WLAD-LASSO diperoleh senyawa/formula penanda antioksidan dalam ekstrak daun sembung yaitu umbelliferone, 12-hydroxyjasmonic acid, C22H14N8O2, dan acetyleugenol dengan kesalahan prediksi MAD = 0,1331.	id
dc.description.abstract	Least Absolute Deviation (LAD) is an alternative method to overcome outliers, but it is very sensitive to outliers in the explanatory variables. Weighted LAD (WLAD) has been introduced to deal with outliers in the explanatory variables. In several research areas, it is common to have a dataset with more explanatory variables than the number of observations, called high-dimensional data (p≫n). This condition can lead to multicollinearity problem, so the variance of parameter estimation becomes large. The Best Subset, Stepwise and Ridge Regression can reduce the variance of estimates and sacrifice a little bias, but they are not stable in predicting the model and the interpretation of the model becomes more difficult if the explanatory variables used are very large. Least Absolute Shrinkage and Selection Operator (LASSO) can overcome these problems, but this method is sensitive to outliers. Therefore, robust methods are needed to addresed these problems such as LAD-LASSO and WLAD-LASSO which are based on LAD approach. This current research aims to evaluate the performance of the LAD-LASSO and WLAD-LASSO methods on high-dimensional and low-dimensional data containing outliers. The best method is then implemented to find the compound of antioxidant markers in the sembung leaf extract. The results can be developed as natural antioxidants and have the potential to be developed as medicinal ingredients. To evaluate the performance of the LAD-LASSO and WLAD-LASSO methods in this research, the simulation study is conducted on high-dimensional data (n=50, p=100) and low-dimensional data (n=50, p=10). In addition, the methods are also compared to LASSO. The simulation study uses three scenarios. The first is without outliers. The second is outliers on the response variable (5%, 10%, 15%). The third is outliers on the response (5%, 10%, 15%) and explanatory (20%) variables. Furthermore, this research use the Minimum Regularized Covariance Determinant (MRCD) estimator in calculating the weights on the WLAD-LASSO because this estimator can be applied in low and high dimensional data. The best method then will be applied to sembung leaf extract data to identify antioxidant marker compounds in sembung leaf extract. In this case, the data will be divided in to 80% training data and 20% testing data for getting optimum λ based on cross validation. The simulation results show that when outliers exist in the response variables, the significant variables selected for WLAD-LASSO and LASSO tend to be more than LAD-LASSO. WLAD-LASSO and LASSO prediction error rates are lower than LAD-LASSO. The insignificant variable selected for WLAD-LASSO was less than LASSO and more than LAD-LASSO. When outliers are also present on the explanatory variable, LASSO selects insignificant variables more than WLAD-LASSO and LAD-LASSO. It is because LASSO based on OLS, so outliers cause LASSO to tend to be bad in estimating the coefficient "0", meanwhile LAD-LASSO and WLAD-LASSO are based on LAD that is minimizing the sum of absolute residuals so that they are more robust against outliers in the response variable. Even the existence of weights cause WLAD-LASSO more robust against the presence of outliers in the response and explanatory variables compared to LAD-LASSO. Furthermore, performance of LASSO, LAD-LASSO, and WLAD-LASSO on high-dimensional data decrease compared to low-dimensional data. The performance of these methods also tends to decrease when the correlation among the explanatory variables and the rate of outlier increases. Exploration of sembung leaf extract data shows that the data is high dimensional data which contains outlier about 14,28% on the response variable and about 25,71% on the explanatory variables. WLAD-LASSO is applied to this data because it is in accordance with the simulation results which show good performance in identifying significant variables correctly. Based on the results of cross-validation of the data, it was obtained that the optimum of the WLAD-LASSO method was 0,008 with the selected compound/formula namely umbelliferone, 12-hydroxyjasmonic acid, C22H14N8O2, and acetyleugenol. Based on the research, it can be concluded that WLAD-LASSO is in the middle of those two methods and performs the best in identifying the important variables correctly (it can overcome the weakness of LAD-LASSO which selects very slight significant variables and the weakness of LASSO which selects more insignificant variables in the selection of variables). Performance of these methods on high-dimensional data decrease compared to low-dimensional data. The WLAD-LASSO was then implemented to find the compound of antioxidant markers in the sembung leaf extract. The compounds/formulas obtained are umbelliferone, 12-hydroxyjasmonic acid, C22H14N8O2, and acetyleugenol with prediction error MAD = 0,1331.	id
dc.language.iso	id	id
dc.publisher	IPB University	id
dc.title	Kajian Kinerja Metode Regresi LAD-LASSO dan WLAD-LASSO pada Data Dimensi Tinggi yang Mengandung Pencilan	id
dc.title.alternative	Performance Study of Regression Methods LAD-LASSO and WLAD-LASSO in High Dimensional Data Containing Outliers	id
dc.type	Thesis	id
dc.subject.keyword	High dimensional data	id
dc.subject.keyword	LAD-LASSO	id
dc.subject.keyword	multicollinearity	id
dc.subject.keyword	outliers	id
dc.subject.keyword	WLAD-LASSO	id

Files in this item

Name:: Cover, Lembar pernyataan, ...
Size:: 37.48Mb
Format:: PDF
Description:: Cover

View/Open

Name:: G1501201037_Septa Dwi Cahya.pdf
Size:: 13.77Mb
Format:: PDF
Description:: Fullteks

View/Open

Name:: Lampiran.pdf
Size:: 6.895Mb
Format:: PDF
Description:: Lampiran

View/Open

This item appears in the following Collection(s)

MT - Mathematics and Natural Science [3884]

Show simple item record