Performa Model Ensemble Berbasis Pohon pada Klasifikasi Penyakit Retinopati Diabetik

Nurizki, Anisa

dc.contributor.advisor	Fitrianto, Anwar
dc.contributor.advisor	Soleh, Agus Mohamad
dc.contributor.author	Nurizki, Anisa
dc.date.accessioned	2024-05-19T23:59:21Z
dc.date.available	2024-05-19T23:59:21Z
dc.date.issued	2024
dc.identifier.uri	http://repository.ipb.ac.id/handle/123456789/150468
dc.description.abstract	Klasifikasi adalah tahap dalam proses identifikasi model yang memisahkan data ke dalam kelas-kelas. Salah satu model klasifikasi yang umum digunakan adalah Pohon Klasifikasi. Meski mudah digunakan, Pohon Klasifikasi rentan terhadap perubahan data latih yang dapat mengganggu stabilitas model. Ini bisa diatasi dengan menggunakan model ensemble, yang menggabungkan beberapa model dasar untuk prediksi yang lebih baik. Contohnya, random forest (RF) menggabungkan pohon-pohon yang dibuat dengan teknik bagging. Model ensemble lainnya seperti double random forest (DRF) dan extremely randomized trees (ET) juga meningkatkan akurasi dengan pohon yang lebih besar. Model extreme gradient boosting (XGBoost) yang merupakan pengembangan dari gradient boosting, adalah salah satu model klasifikasi dengan teknik boosting yang memiliki waktu komputasi yang cepat dan performa yang baik dibanding model pengembangnya. Kemampuan model ensemble-tree dalam melakukan klasifikasi menjadi daya tarik untuk memahami karakteristik masing-masing model tersebut. Karakteristik tersebut dipelajari menggunakan lima belas gugus data UCI Machine Learning dari tahun 2012 hingga 2022. Data yang digunakan memiliki variasi dalam jumlah variabel dan observasi. Selain itu, terdapat enam gugus data dengan rasio ketidakseimbangan kelas yang tinggi dan menghasilkan performa model yang kurang memuaskan, sehingga dilakukan penanganan ketidakseimbangan kelas menggunakan metode SMOTE. Langkah selanjutnya adalah mengidentifikasi kemungkinan terjadinya underfitting pada model RF. Proses identifikasi dilakukan untuk menguji performa model RF ketika pohon yang dihasilkan tidak cukup besar. Hasil identifikasi underfitting menunjukkan model RF mengalami kemungkinan underfitting pada delapan gugus data. Selanjutnya, performa model RF dikomparasikan dengan model DRF, ET, dan XGBoost. Komparasi model juga dilakukan pada model RF yang tidak mengalami underfitting di tujuh gugus data lainnya. Proses identifikasi underfitting, pembelajaran mesin dan evaluasi performa model diulang sebanyak 100 kali ulangan untuk mengukur stabilitas dan melihat pola dari performa model ensemble-tree. Hasil evaluasi performa model berdasarkan metrik akurasi seimbang menunjukkan bahwa model RF tetap memiliki performa yang baik ketika mengalami underfitting pada dua gugus data. Sebagai model pembanding, model ET mampu meningkatkan performa model RF ketika mengalami underfitting pada empat gugus data. Model DRF dan XGBoost juga mampu meningkatkan performa model RF masing-masing pada satu gugus data. Di sisi lain, pada model RF yang tidak mengalami underfitting, model RF menunjukkan performa yang baik pada dua gugus data, model ET pada tiga gugus data, dan model DRF pada satu gugus data. Rataan akurasi seimbang memberikan informasi terkait karakteristik model ensemble-tree. Karakteristik yang menonjol dalam penelitian ini adalah kecepatan komputasi model ET dalam melakukan pembelajaran mesin. Data dengan peubah kategorik yang lebih dominan juga lebih baik dimodelkan menggunakan model ET. Selain itu, model ET dan RF unggul dalam melakukan klasifikasi saat kelas pada peubah respon menunjukkan rasio ketidakseimbangan yang cukup besar. Ketika model RF menghadapi kemungkinan underfitting, model ET dan DRF dapat menjadi alternatif yang baik untuk menjalankan proses klasifikasi. Karakteristik tersebut dapat dijadikan pedoman dalam menerapkan model ensemble-tree pada suatu data. Meskipun demikian, selisih rataan akurasi seimbang antar model relatif kecil. Selain itu, rataan dan simpangan baku skor f1 juga menunjukkan performa antar model relatif mirip. Hasil evaluasi performa model dari kedua parameter ini mengidentifikasi bahwa keempat model memiliki performa yang relatif sama. Oleh karena itu, dilakukan uji Kruskal-Wallis untuk menguji asumsi tersebut. Hasil uji Kruskal-Wallis menunjukkan bahwa keempat model ensemble-tree memiliki performa model yang relatif sama. Berdasarkan analisis karakteristik dan performa model ensemble-tree terhadap data dari UCI Machine Learning, keempat model tersebut selanjutnya dievaluasi menggunakan dataset penyakit retinopati diabetik yang diperoleh dari Rumah Sakit Khusus Mata Padang Eye Center. Penerapan model ini dilakukan pada data hasil pemindaian Optical Coherence Tomography (OCT) makula. Keunggulan model ensemble-tree dalam melakukan klasifikasi menjadi alasan utama penerapan model ini pada data OCT makula. Sebelum melakukan klasifikasi, data dibagi menjadi empat skenario. Pembagian data menjadi empat skenario dilakukan untuk melihat kondisi terbaik data dan kecocokan data dengan model ensemble-tree. Hasil menunjukkan penggunaan model ensemble-tree terbukti efektif dalam mengklasifikasikan penyakit retinopati diabetik. Skenario III yang melibatkan data penyakit retinopati diabetik untuk membandingkan antara penderita retinopati diabetik dan non-diagnoasis (sehat), dianggap sebagai kondisi data yang paling optimal. Hal ini disebabkan oleh hasil penggunaan data tersebut yang menghasilkan nilai sensitivitas dan akurasi model yang optimal. Dengan demikian, model ensemble-tree memberikan solusi efektif untuk tantangan Klasifikasi Penyakit Retinopati diabetik dengan menggunakan skenario III.	id
dc.description.abstract	Classification is a stage in the model identification process that separates data into classes. One of the commonly used classification models is the Classification Tree. Although easy to use, Classification Trees are susceptible to changes in training data that can destabilize the model. This can be overcome using ensemble models, which combine several underlying models for better prediction. For example, random forest (RF) combines trees created by bagging techniques. Other ensemble models, such as double random forest (DRF) and extremely randomized trees (ET), also improve accuracy with more giant trees. The extreme gradient boosting (XGBoost) model, a development of gradient boosting, is one of the classification models with boosting techniques with a fast computation time and good performance compared to its developer model. The ability of ensemble-tree models to perform classification is of interest to understand the characteristics of each model. These characteristics were studied using fifteen UCI Machine Learning data clusters from 2012 to 2022. The data used varied in the number of variables and observations. In addition, six data clusters with high-class imbalance ratios resulted in unsatisfactory model performance, so class imbalance was addressed using the SMOTE method. The next step is identifying the possibility of underfitting in the RF model. The identification process is carried out to test the performance of the RF model when the resulting tree is not large enough. The results of the underfitting identification showed that the RF model experienced the possibility of underfitting in eight data clusters. Furthermore, the performance of the RF model was compared with the DRF, ET, and XGBoost models. Model comparisons were also performed on RF models that did not experience underfitting in seven other data clusters. The process of underfitting identification, machine learning, and model performance evaluation was repeated 100 times to measure the stability and see the pattern of the ensemble-tree model performance. The results of the model performance evaluation based on the balanced accuracy metric show that the RF model still performs well when underfitting two data clusters. As a comparison model, the ET model improved the RF model's performance when underfitting four data clusters. The DRF and XGBoost models also improved the RF model performance on one data cluster each. On the other hand, when the RF model is not underfitting, the RF model performs well on two data clusters, the ET model on three data clusters, and the DRF model on one data cluster. The average balanced accuracy provides information regarding the characteristics of the ensemble-tree model. This study's prominent characteristic is the ET model's computational speed in performing machine learning. Data with more dominant categorical variables are also better modeled using the ET model. In addition, the ET and RF models excel in classification when the classes of the response variables show a large imbalance ratio. When the RF model faces the possibility of underfitting, the ET and DRF models can be an excellent alternative to run the classification process. These characteristics can be guidelines for applying the ensemble-tree model to the data. However, the difference in average balanced accuracy between models is relatively small. In addition, the mean and standard deviation of the f1 score also show that the performance between models is relatively similar. The results of evaluating model performance from these two parameters identified that the four models had relatively similar performance. Therefore, the Kruskal-Wallis test was conducted to test this assumption. The Kruskal-Wallis test results showed that the four ensemble-tree models had relatively similar model performance. Based on the analysis of the characteristics and performance of the ensemble-tree models on data from UCI Machine Learning, the four models were further evaluated using the diabetic retinopathy disease dataset obtained from the Padang Eye Center Specialty Hospital. This model was applied to macular Optical Coherence Tomography (OCT) scan data. The superiority of the ensemble-tree model in performing classification is the main reason for using this model for macular OCT data. Before classification, the data was divided into four scenarios. The division of data into four scenarios was done to see the best condition of the data and fit with the ensemble-tree model. The results showed that using the ensemble-tree model proved effective in classifying diabetic retinopathy disease. Scenario III, which involves diabetic retinopathy disease data to compare diabetic retinopathy and non-diagnosis (healthy) patients, is considered the most optimal data condition. This is due to the results of using this data, which resulted in optimal sensitivity and accuracy values for the model. Thus, the ensemble-tree model effectively solves the Diabetic Retinopathy Disease Classification challenge using scenario III.	id
dc.language.iso	id	id
dc.publisher	IPB University	id
dc.title	Performa Model Ensemble Berbasis Pohon pada Klasifikasi Penyakit Retinopati Diabetik	id
dc.title.alternative	PERFORMANCE OF TREE-BASED ENSEMBLE MODELS ON DIABETIC RETINOPATHY DISEASE CLASSIFICATION	id
dc.type	Thesis	id
dc.subject.keyword	ensemble-tree	id
dc.subject.keyword	classification	id
dc.subject.keyword	diabetic retinopathy	id
dc.subject.keyword	random forest	id

Files in this item

Name:: Cover, Lembar Pengesahan, Prakata, ...
Size:: 471.2Kb
Format:: PDF
Description:: Cover

View/Open

Name:: G1501211024_Anisa Nurizki.pdf
Size:: 1015.Kb
Format:: PDF
Description:: Fulltext

View/Open

Name:: Lampiran.pdf
Size:: 303.0Kb
Format:: PDF
Description:: Lampiran

View/Open

This item appears in the following Collection(s)

MT - Mathematics and Natural Science [3906]

Show simple item record