Please use this identifier to cite or link to this item: http://repository.ipb.ac.id/handle/123456789/166507
Title: Transformer-Based Model Optimization of IndoBERT and XLNet for Indonesian Hoax News Detection
Other Titles: Optimalisasi Model Berbasis Transformer IndoBERT dan XLNet untuk Deteksi Berita Hoaks Berbahasa Indonesia
Authors: Saefuddin, Asep
Rahardiantoro, Septian
Gilbran, Farrel
Issue Date: 2025
Publisher: IPB University
Abstract: The spread of hoax news on the internet has become a serious threat that can influence public opinion, create social tensions, and damage the reputation of individuals or groups. Masyarakat Anti-Fitnah Indonesia (MAFINDO) has manually identified hoax news through the platform turnbackhoax.id. However, this method is inefficient as the volume of information increases. Therefore, a machine learning-based solution is needed to automate and improve hoax news detection. This study compares the performance of transformer-based models, namely IndoBERT and XLNet, in classifying Indonesian-language hoax news. The dataset consists of 18689 hoax articles from turnbackhoax.id and 10229 non-hoax articles scraped from Detik.com. The process involves preprocessing steps such as filtering, case folding, and tokenization before the data is used for modeling. To address potential data imbalance and ensure robust evaluation, stratified 3-fold cross-validation was employed. Model optimization was carried out using Optuna, an efficient hyperparameter optimization framework. The results show that Indonesian-language transformer models, particularly IndoBERT, perform exceptionally well in hoax news detection. IndoBERT achieved the best performance with an accuracy of 0.9924, precision of 0.9968, recall of 0.9914, and F1 score of 0.9941. Although XLNet also demonstrated strong performance, it was slightly less effective than IndoBERT, suggesting it may be less optimal for Indonesian-language tasks without specific adaptation. Overall, the study shows that stratified cross-validation and proper hyperparameter tuning can effectively mitigate data imbalance. Moreover, differences in token length between hoax and non-hoax classes influence model learning, highlighting the importance of accounting for both sample size and textual complexity.
Penyebaran berita hoaks di internet telah menjadi ancaman serius yang dapat memengaruhi opini publik, menciptakan ketegangan sosial, dan merusak reputasi individu maupun kelompok. Masyarakat Anti-Fitnah Indonesia (MAFINDO) telah melakukan identifikasi berita hoaks secara manual melalui platform turnbackhoax.id. Namun, metode ini menjadi tidak efisien seiring dengan meningkatnya volume informasi. Oleh karena itu, dibutuhkan solusi berbasis machine learning untuk mengotomatisasi dan meningkatkan deteksi berita hoaks. Penelitian ini membandingkan performa model berbasis transformer, yaitu IndoBERT dan XLNet, dalam mengklasifikasikan berita hoaks berbahasa Indonesia. Dataset yang digunakan terdiri atas 18689 artikel hoaks dari turnbackhoax.id dan 10229 artikel non-hoaks yang diambil dari detik.com. Proses analisis melibatkan tahapan praproses seperti filtering, case folding, dan tokenisasi sebelum data digunakan untuk pemodelan. Untuk mengatasi potensi ketidakseimbangan data dan memastikan evaluasi yansg andal, digunakan stratified 3-fold cross-validation. Optimasi model dilakukan dengan menggunakan Optuna, sebuah framework efisien untuk optimasi hyperparameter. Hasil penelitian menunjukkan bahwa model transformer berbahasa Indonesia, khususnya IndoBERT, memiliki performa yang sangat baik dalam mendeteksi berita hoaks. IndoBERT mencapai performa terbaik dengan akurasi sebesar 0.9924, presisi 0.9968, recall 0.9914, dan skor F1 sebesar 0.9941. Meskipun XLNet juga menunjukkan performa yang kuat, model ini sedikit kurang efektif dibandingkan IndoBERT, yang menunjukkan bahwa XLNet mungkin kurang optimal untuk tugas berbahasa Indonesia tanpa adaptasi khusus. Secara keseluruhan, penelitian ini menunjukkan bahwa stratified cross-validation dan tuning hyperparameter yang tepat dapat secara efektif mengatasi ketidakseimbangan data. Selain itu, perbedaan panjang token antara kelas hoaks dan non-hoaks memengaruhi proses pembelajaran model, yang menegaskan pentingnya mempertimbangkan baik jumlah sampel maupun kompleksitas teks.
URI: http://repository.ipb.ac.id/handle/123456789/166507
Appears in Collections:UT - Statistics and Data Sciences

Files in This Item:
File Description SizeFormat 
cover_G1401211057_4af4cc58747b4cf791b5456ac5708ac3.pdfCover626.06 kBAdobe PDFView/Open
fulltext_G1401211057_59ae85ce342c4fad8d081b373d850f38.pdf
  Restricted Access
Fulltext5.12 MBAdobe PDFView/Open
lampiran_G1401211057_13f204ce933142dd8d2ff16a078fa5e5.pdf
  Restricted Access
Lampiran351.09 kBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.