Text Clustering Opini Pembelajaran Daring di Indonesia Selama Masa Pandemi COVID-19 pada Media Sosial Twitter

Tyas, Maulida Fajrining

Please use this identifier to cite or link to this item: http://repository.ipb.ac.id/handle/123456789/115019

Title:	Text Clustering Opini Pembelajaran Daring di Indonesia Selama Masa Pandemi COVID-19 pada Media Sosial Twitter
Other Titles:	Text Clustering Online Learning Opinion in Indonesia during COVID-19 Pandemic using Twitter Data
Authors:	Kurnia, Anang Soleh, Agus Mohamad Tyas, Maulida Fajrining
Issue Date:	Oct-2022
Publisher:	IPB University
Abstract:	Dalam Surat Edaran Menteri Pendidikan dan Kebudayaan Republik Indonesia Nomor 36962/MPK.A/HK/2020 pada tanggal 17 Maret 2020, dihimbau bahwa kegiatan pembelajaran dilakukan secara dalam jaringan (daring) serta bekerja dari rumah dalam rangka mencegah penyebaran corona virus disease (COVID-19). Pembatasan sosial diberlakukan termasuk pada kegiatan belajar mengajar di sekolah yang menuai pro dan kontra di tengah masyarakat. Opini terkait pembelajaran daring banyak tersebar terutama pada media sosial twitter melalui tulisan pada tweet yang dapat digunakan untuk mengekstrak informasi terkait topik yang dibicarakan tentang pembelajaran daring selama pandemi di Indonesia. Kumpulan tweet tersebut dapat dimanfaatkan dengan metode Text clustering yang merupakan bagian dari text mining di mana metode ini menerapkan algoritma Unsupervised Machine Learning untuk mengelompokkan data tekstual (tweet) ke dalam gerombol-gerombol yang memiliki karakteristik yang sama. K-Means banyak digunakan serta memiliki performa yang baik pada area text clustering. Namun, permasalahan sering terjadi pada proses Text Clustering di mana jumlah data tekstual yang tersedia biasanya sangat besar (big data) serta memiliki peubah (fitur) berdimensi tinggi yang mengakibatkan komputasi yang sulit dan lama. Hasil gerombol yang diperoleh akan tidak efisien dan rumit untuk diinterpretasikan sehingga pengaplikasian konsep percontohan digunakan untuk mereduksi dimensi pada data. Percontohan tersebut bertujuan untuk menjawab permasalahan dari pereduksian data secara objek maupun peubah tanpa mengurangi tingkat semantik dalam kumpulan tweet yang diperoleh, sehingga manfaat dari penyederhanaan tersebut dapat menghasilkan penggerombolan yang lebih bermakna dan memiliki akurasi yang sama tanpa harus menggunakan keseluruhan data. Tujuan penelitian ini adalah mengevaluasi metode percontohan dalam pembuatan gerombol dibandingkan dengan menggunakan keseluruhan data teks dengan metode penggerombolan K-Means. Penelitian ini berfokus untuk mengeksplorasi bagaimana permasalahan data berdimensi tinggi dari kumpulan tweet dapat diselesaikan menggunakan metode percontohan yang dapat menggambarkan sebaran opini dari masyarakat Indonesia terhadap fenomena pembelajaran daring selama pandemi COVID-19. Pengambilan contoh dari 28300 tweet dilakukan pada enam ukuran contoh yaitu 250, 500, 2500, 10000, 15000 dan 20000 yang selanjutnya dilakukan prapemrosesan yang terdiri dari pembersihan tweet, tokenization, case-folding, nonstandard word handling dan stopword removal. Hasil pra-pemrosesan diubah menjadi bentuk document-term-matrix yang memuat bobot TF.IDF dari setiap kata per tweet dan menjadi input untuk algoritma penggerombolan. Optimasi pada jumlah k gerombol dilakukan pada nilai k = 2 hingga k = 10 masing-masing sebanyak 10 kali dengan melihat nilai sillhouette yang menghasilkan standar deviasi paling minimum. Gerombol optimum kemudian divisualisasikan dengan wordcloud untuk mengidentifikasi topik yang terbentuk. Proses tersebut dilakukan pada setiap ukuran contoh yang diulang sebanyak 10 kali. Persentase kemunculan topik di atas 50% dari total 10 ulangan dipertimbangkan sebagai gerombol yang representatif. Dari 10 ulangan, ukuran contoh 250 dan 500 hanya mampu menangkap 1 dari 10 topik, ukuran contoh 2500 dan 10000 menghasilkan 4 dari 10 topik, ukuran contoh 15000 menghasilkan 8 dari 10 topik dan ukuran contoh 20000 menghasilkan 7 dari 10 topik. Ukuran contoh yang diambil kurang dari 50% dari keseluruhan data cenderung memiliki persentase kemunculan topik yang lebih rendah yaitu sekitar 26%-39%. Persentase kemunculan topik tidak memiliki perubahan pada ukuran contoh yang terambil mulai dari 50% ke atas sehingga penggunaan ukuran contoh yang lebih kecil lebih efisien untuk digunakan. Secara waktu eksekusi, ukuran contoh 15000 lebih cepat dibandingkan dengan ukuran contoh 20000 dan keseluruhan tweet. Persentase ukuran contoh yang terambil sebesar kurang lebih 50% menghasilkan waktu eksekusi sekitar 5 menit sedangkan waktu yang dibutuhkan untuk menggunakan keseluruhan tweet mencapai 10 menit. Penggunaan metode percontohan dapat menjadi solusi dari data tekstual dalam mereduksi dimensi objek serta peubah untuk memperoleh hasil penggerombolan optimal di mana besaran ukuran contoh sebesar 50% dari total keseluruhan tweets sudah mampu untuk mencakup hasil penggerombolan yang representatif dan efisien dalam segi waktu eksekusi yaitu dua kali lebih cepat daripada menggunakan keseluruhan tweet. In the Circular Letter of the Minister of Education and Culture of the Republic of Indonesia Number 36962/MPK.A/HK/2020 on March 17, 2020, learning activities are recommended to be online and work from home in order to prevent the spread of corona virus disease (COVID-19). Social restrictions are imposed including on teaching and learning activities in schools which reap pros and cons in the community. Opinions related to online learning are widely conveyed, especially on Twitter through tweets that can be used to extract information related to topics about online learning during the pandemic in Indonesia. The collection of tweets can be utilized using text clustering method which is part of text mining where it applies the unsupervised machine learning algorithm to group textual data (tweets) into clusters that have the same characteristics. K-Means is widely used and has good performance in the text clustering area. However, problems often occur in Text Clustering where the amount of textual data available is usually very large (big data) and has high-dimensional variables (features) that result in difficult and time-consuming computations. The cluster results obtained will be inefficient and complicated to interpret so the sampling method is used to reduce the dimensions of the data. Sampling method aims to answer the problem of reducing data by object or variable without reducing the semantic level in the collection of tweets, so that the benefits of this simplification can result in clustering that is more meaningful and has the same accuracy without having to use the entire data. The purpose of this study is to evaluate the sampling method in making clusters compared to using the entire tweets with K-Means. This study focuses on exploring how the problem of high-dimensional data from a collection of tweets can be solved using a sampling method that can describe the opinions from the Indonesian people towards the phenomenon of online learning during the COVID- 19 pandemic. Sampling of 28300 tweets was carried out on six sample sizes, namely 250, 500, 2500, 10000, 15000 and 20000 which were then pre-processed with tweet cleaning, tokenization, case-folding, non-standard word handling and stopword removal. The pre-processing results are converted into a document-term-matrix that contains the TF.IDF weights of each word per tweet and becomes the input for the clustering algorithm. Optimization of the number of k clusters is carried out at the value of k = 2 to k = 10 each for 10 times by measuring the sillhouette value that produces the minimum standard deviation. Optimum clusters are visualized using wordcloud to identify the topics that were formed. The process is carried out 10 times for each sample sizes. The percentage of topic occurrences above 50% of the total 10 iterations was considered as a representative cluster. From 10 iterations, the sample size of 250 and 500 was only able to capture 1 out of 10 topics, the sample size of 2500 and 10000 resulted in 4 of 10 topics, the sample size of 15000 resulted in 8 of 10 topics and the sample size of 20000 resulted in 7 of 10 topics. Sample size taken less than 50% of the total data tend to have a lower percentage of topic occurrences which is around 26%-39%. The percentage of topic occurrences has no change in the sample size taken from 50% and above so the use of a smaller sample size is more efficient to use. In terms of execution time, sample size 15000 is faster than sample size 20000 and overall tweets. The percentage of sample size taken is approximately 50% resulting in an execution time of about 5 minutes while the time required to use the entire tweet is up to 10 minutes. The sampling method can be a solution for textual data to reduce the dimensions of objects and variables to obtain optimal clustering results where the sample size of 50% of the total tweets is able to cover clusters that are representative and efficient in terms of execution time, which is twice as fast as using a whole tweet.
URI:	http://repository.ipb.ac.id/handle/123456789/115019
Appears in Collections:	MT - Mathematics and Natural Science

Files in This Item:

File	Description	Size	Format
Cover, Lembar Pengesahan, Prakata, Daftar Isi.pdf Restricted Access	Cover	3.31 MB	Adobe PDF	View/Open
G1501202041_Maulida Fajrining Tyas.pdf Restricted Access	Fullteks	8.42 MB	Adobe PDF	View/Open
Lampiran.pdf Restricted Access	Lampiran	1.66 MB	Adobe PDF	View/Open

Show full item record Recommend this item

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets