Analisis Sentimen Bahasa Indonesia pada Twitter Menggunakan Struktur Tree Berbasis Leksikon

Saputra, Feby Tri

dc.contributor.advisor	Nurhadryani, Yani
dc.contributor.advisor	Wijaya, Sony Hartono
dc.contributor.advisor	Defina
dc.contributor.author	Saputra, Feby Tri
dc.date.accessioned	2021-02-01T01:15:50Z
dc.date.available	2021-02-01T01:15:50Z
dc.date.issued	2020
dc.identifier.uri	http://repository.ipb.ac.id/handle/123456789/105583
dc.description.abstract	Jumlah opini publik yang tersebar di berbagai media semakin meningkat seiring berkembangnya teknologi, informasi, dan komunikasi. Melihat hal tersebut, tidak mungkin untuk membaca setiap opini yang disampaikan satu per satu untuk mendapatkan sentimen yang disampaikan oleh masyarakat. Analisis sentimen atau biasa disebut opinion mining dapat dijadikan suatu solusi untuk mendapatkan sentimen publik. Analisis sentimen dapat mengklasifikasikan teks yang mengandung opini sebagai kelas positif, negatif, atau netral secara otomatis. Analisis sentimen berbasis leksikon adalah salah satu pendekatan yang dapat menghasilkan performa yang baik pada lintas topik pembicaraan, dapat dengan mudah ditingkatkan dengan berbagai sumber pengetahuan, dan tidak memerlukan pelatihan lebih lanjut. Secara umum, pendekatan dalam analisis sentimen terbagi ke dalam dua hal, yaitu pendekatan berbasis corpus dan pendekatan berbasis leksikon. Pendekatan berbasis corpus memiliki akurasi yang lebih tinggi dibandingkan dengan pendekatan berbasis leksikon. Namun, pendekatan berbasis corpus sangat bergantung pada kualitas dan jumlah data pelatihan. Sebaliknya, metode berbasis leksikon menghasilkan performa yang baik pada lintas topik pembicaraan, dapat dengan mudah ditingkatkan dengan berbagai sumber pengetahuan, dan tidak memerlukan pelatihan lebih lanjut. Penelitian analisis sentimen berbasis leksikon pada bahasa Indonesia telah banyak dilakukan. Penelitian yang telah dilakukan sebelumnya menjelaskan bahwa sentimen leksikon Indonesia lebih cocok digunakan pada analisis sentimen bahasa Indonesia dibandingkan sentimen leksikon Inggris seperti SentiWordNet. Pada penelitian-penelitian sebelumnya, analisis sentimen berbasis leksikon hanya menghitung frekuensi kemunculan kata berdasarkan sentimen leksikon saja tanpa memperhatikan makna dalam hubungan antarkata. Selain itu, kekurangan lainnya pada penelitian sebelumnya adalah belum dapat menangani bahasa tidak baku dan sentimen leksikon Indonesia yang digunakan masih relatif sedikit. Hubungan antarkata merupakan suatu hal yang penting dalam analisis sentimen. Jika hubungan antarkata tidak diperhatikan, dapat menyebabkan hilangnya informasi pada hubungan setiap kata. Adapun hubungan antarkata dapat mengubah polaritas sentimen (positif, negatif, atau netral) pada teks. Hubungan antarkata dalam suatu teks dapat direpresentasikan dengan baik menggunakan struktur tree sebagai suatu hirarki pembentukan kalimat. Penelitian analisis sentimen berbasis tree sebelumnya hanya memanfaatkan struktur tree sebagai ekstraksi fitur yang hasilnya kemudian digunakan pada pelatihan berbasis machine learning dan deep learning. Namun, penelitian tersebut memiliki biaya komputasi yang lebih besar dibandingkan analisis sentimen berbasis leksikon karena melakukan konstruksi tree sekaligus pelatihan data. Oleh karena itu, struktur tree digunakan pada pendekatan berbasis leksikon untuk memperkecil biaya komputasi. Berdasarkan permasalahan pada penelitan sebelumnya, penelitian ini bertujuan untuk meningkatkan performa analisis sentimen berbasis leksikon sebelumnya dengan menangani kata tidak baku, menambahkan kata pada sentimen leksikon, dan menggunakan struktur tree sebagai interpretasi hubungan antarkata dalam suatu teks. Hasil penelitian ini diharapkan dapat menghasilkan model analisis sentimen berbasis leksikon yang memiliki performa tinggi dan stabil pada berbagai lintas topik tanpa pelatihan. Penelitian ini menggunakan dua metode yaitu metode tanpa tree dan berbasis tree. Metode tanpa tree hanya menghitung frekuensi kemunculan kata dalam penentuan hasil klasifikasi. Adapun hasil klasifikasinya ditentukan berdasarkan perbandingan frekuensi kemunculan kata positif dan negatif. Jika frekuensi kata positif lebih besar, klasifikasinya bernilai positif. Hal tersebut berlaku sebaliknya. Namun, jika frekuensi kata positif dan negatif bernilai sama, hasil klasifikasinya bernilai netral. Sedangkan metode berbasis tree, menggunakan struktur tree sebagai interpretasi kata dalam suatu teks. Pembentukan tree diawali dengan membuat subtree pada klausa. Jika teks termasuk ke dalam kalimat majemuk, dilakukan pemisahan klausa berdasarkan kata konjungsi yang menghubungkannya. Setiap subtree yang telah dibentuk disatukan kembali oleh node yang bernilai kata konjungsi yang menghubungkannya. Hasil klasifikasi pada metode berbasis tree ditentukan oleh aturan-aturan yang diusulkan seperti menggunakan sifat perkalian kedua bilangan positif dan negatif, sifat logika matematika seperti konjungsi dan disjungsi, dll. Penentuan hasil klasifikasi dilakukan dengan melakukan penelusuran tree dari level terendah (bottom-up). Penelitian ini membandingkan performa kedua metode tersebut menggunakan pengujian akurasi dan weighted f1-measure. Analisis sentimen yang diujikan merupakan data lintas topik seperti data twit Pilgub Jabar 2018, Pilpres 2019, dan pandemik COVID-19. Penelitian ini menunjukkan bahwa analisis sentimen berbasis leksikon sangat bergantung pada jumlah dan keberagaman kata di dalamnya. Maka dari itu sangat penting untuk memiliki sentimen leksikon yang memiliki jumlah kata yang banyak dan beragam. Penelitian ini menghasilkan sentimen leksikon Indonesia dengan jumlah kata positif dari 1 075 menjadi 2 777 dan kata negatif dari 1 825 menjadi 3 107. Selain itu, data pembakuan kata yang lebih besar terbukti dapat menjaring lebih banyak fitur kata yang tidak dapat dijaring sebelumnya. Penelitian ini menghasilkan data pembakuan kata dengan jumlah dari 3 720 menjadi 5 198. Hasil uji akurasi menunjukkan bahwa penambahan sentimen leksikon dapat meningkatkan akurasi sebesar 4.07-7.24% pada metode tanpa tree dan 7.86-13.45% pada metode berbasis tree. Pengujian akurasi setelah penambahan sentimen leksikon menunjukkan metode berbasis tree menghasilkan performa yang stabil pada beberapa lintas topik. Hal ini terbukti dari nilai standar deviasi akurasi metode berbasis tree lebih rendah dibandingkan metode tanpa tree pada sebelum penambahan leksikon (2.18% dibandingkan 5.36%) maupun setelah penambahan leksikon (0.97% dibandingkan 3.78%). Perbandingan nilai akurasi dan weighted f1-measure menggunakan rentang proporsi menunjukkan bahwa akurasi kedua metode umumnya valid di seluruh data uji. Namun, akurasi metode tanpa tree pada data Pilpres 2019 tidak valid. Hal ini menjelaskan bahwa metode tanpa tree tidak dapat menangani data tidak seimbang yang berukuran besar. Berdasarkan seluruh pengujian, metode berbasis tree terbukti dapat menghasilkan performa yang lebih baik dan stabil pada lintas topik.	id
dc.description.abstract	The number of public opinions spread across various media is increasing along with the development of technology, information and communication. Seeing this, it is impossible to read every opinion expressed one by one to get the sentiments conveyed by the public. Sentiment analysis or so-called opinion mining can be used as a solution to get the public sentiment. Sentiment analysis can automatically classify texts that contain opinions as positive, negative, or neutral. Lexicon-based sentiment analysis is one approach that can perform well across topics of conversation, can be easily scaled up with multiple sources of knowledge, and does not require further training. In general, the approach to sentiment analysis is divided into two things, namely the corpus-based approach and the lexicon-based approach. The corpusbased approach has higher accuracy than the lexicon-based approach. However, the corpus-based approach is highly dependent on the quality and quantity of training data. In contrast, the lexicon-based method performs well across topics of conversation, can be easily scaled up with multiple sources of knowledge, and does not require further training. There have been many kinds of research on lexicon-based sentiment analysis in Indonesian. Previous research has explained that the Indonesian sentiment lexicon is more suitable to be used in the analysis of Indonesian sentiment compared to the English sentiment lexicons such as SentiWordNet. In previous studies, lexicon-based sentiment analysis only calculates the frequency of word occurrences based on sentiment lexicon without paying attention to the meaning in the relationship between words. In addition, other shortcomings in previous studies are that it cannot handle non-standard language and the Indonesian sentiment lexicon used is still relatively small. The relationship between words is an important thing in sentiment analysis. If the relationship between words is not considered, it can lead to loss of information on the relationship between words. The relationship between words can change the polarity of sentiments (positive, negative, or neutral) in the text. Relationships between words in a text can be represented well using a tree structure as a hierarchy of sentence formation. Previous tree-based sentiment analysis research only used tree structures as feature extraction, the results of which were then used in machine learning and deep learning. However, those research has a higher computational cost than the lexiconbased sentiment analysis because it carries out tree construction as well as data training. Therefore, the tree structure is used in a lexicon-based approach to reduce computation costs. Based on the problems in previous research, those research aims to improve the performance of the previous lexicon-based sentiment analysis by dealing with non-standard words, adding words to sentiment lexicons, and using a tree structure as an interpretation of the relationship between words in a text. The results of this study are expected to produce a lexicon-based sentiment analysis model that has high and stable performance on a variety of cross-topics without training. This study uses two methods, namely the non-tree method and tree-based method. The non-tree method only calculates the frequency of occurrence of words in determining the classification results. The classification results are determined based on the comparison of the frequency of occurrence of positive and negative words. If the frequency of positive words is greater, the classification is positive. The opposite is true. However, if the frequency of positive and negative words is the same, the classification result is neutral. Meanwhile, the tree-based method uses a tree structure to interpret words in a text. Tree formation begins with creating a subtree in the clause. If the text is included in a compound sentence, clauses are separated based on the conjunction that connects them. Each subtree is put back together by a node that has a conjunction value. The classification results in the tree-based method are determined by proposed rules such as using the multiplication of both positive and negative numbers, mathematical logic such as conjunctions and disjunctions, etc. The classification result is determined by tracing the tree from the lowest level (bottom-up). This study compares the performance of the two methods using accuracy and weighted f1-measure. The sentiment analysis tested was cross-topic data such as tweet data from the 2018 West Java Governor Election, the 2019 Presidential Election, and the COVID-19 pandemic. This study shows that lexicon-based sentiment analysis is highly dependent on the number and variety of words in it. Therefore it is very important to have a lexicon sentiment that has a large and varied number of words. This study resulted in the Indonesian sentiment lexicon with the number of positive words from 1 075 to 2 777 and negative words from 1 825 to 3 107. Also, larger word standardization data were shown to be able to capture more word features that could not be captured before. This study resulted in the standardization of words with the number of word from 3 720 to 5 198. The results of the accuracy-test show that the addition of the lexicon sentiment can increase the accuracy by 4.07-7.24% in the non-tree method and 7.86-13.45% in the tree-based method. Accuracy testing after lexicon addition shows the tree-based method produces stable performance on several cross-topics. This is proven from the standard deviation value of the accuracy of the tree-based method is lower than the non-tree method before the addition of the lexicon (2.18% compared to 5.36%) and after the addition of the lexicon (0.97% compared to 3.78%). Comparison of the accuracy and weighted f1-measure values using the proportion range shows that the accuracy of the two methods is generally valid across all test data. However, the accuracy of the non-tree method in the 2019 presidential election data is biased. This explains that the non-tree method cannot handle large unbalanced data. In general, it can be said that the tree method has better and more stable performance than the non-tree method.	id
dc.language.iso	id	id
dc.publisher	IPB University	id
dc.title	Analisis Sentimen Bahasa Indonesia pada Twitter Menggunakan Struktur Tree Berbasis Leksikon	id
dc.title.alternative	Indonesian Sentiment Analysis on Twitter Using a Lexicon-Based Tree Structure	id
dc.type	Thesis	id
dc.subject.keyword	lexicon-based	id
dc.subject.keyword	lexicon addition	id
dc.subject.keyword	sentiment analysis	id
dc.subject.keyword	tree structure	id
dc.subject.keyword	Twitter	id

Files in this item

Name:: Cover, Lembar Pernyataan, Abstrak, ...
Size:: 1.539Mb
Format:: PDF
Description:: Cover

View/Open

Name:: G651180411_Feby Tri Saputra.pdf
Size:: 1.391Mb
Format:: PDF
Description:: Fullteks

View/Open

Name:: Lampiran.pdf
Size:: 877.5Kb
Format:: PDF
Description:: Lampiran

View/Open

This item appears in the following Collection(s)

MT - Mathematics and Natural Science [3983]

Show simple item record