A Study of Sentiment Analysis Using Statistical Machine Learning Approach

Amalia, Rahmatin Nur

Please use this identifier to cite or link to this item: http://repository.ipb.ac.id/handle/123456789/132520

Title:	A Study of Sentiment Analysis Using Statistical Machine Learning Approach
Authors:	Sadik, Kusman Notodiputro, Khairil Anwar Amalia, Rahmatin Nur
Issue Date:	2023
Publisher:	IPB University
Abstract:	Sentiment analysis, which was introduced in the early 2000s, is a method used to analyze opinions and feelings. The goal of sentiment analysis is to determine whether a document contains a positive or negative emotion. Along with the spread of Covid-19 cases, news related to Covid-19 has often become a trending topic in the mass media. The news released by the mass media can influence public opinion, which, in turn, might influence their behavior. This research is essential to understand the public response to Covid-19 so that the governments can make the right decision to handle it. Data acquisition is the initial stage of the analysis. Web scraping is a data acquisition technique that can obtain data from public sources. It is a technique for converting unstructured data into structured data that can be stored and analyzed in a database or spreadsheet. Conducting sentiment analysis using all news becomes more challenging because it might take time and cost. Therefore, the sampling method is needed to obtain representative news for the analysis. Sampling was carried out by using two-stage random sampling. The first stage was used to get sample of news portals using stratified random sampling, and the second stage was to obtain news articles using systematic random sampling. In general, available data tend to come without annotation. Of course, this is a problem because sentiment modelling requires annotated training data, with its annotation as the response variable. As a result, pre-analysis is needed to find representative news for the labelling process. In this research, descriptive analysis and topic modelling are carried out as the basis for selecting news to be labelled. This research used news articles related to Covid-19 in Indonesia. The news data were extracted from Indonesian news portals using web scraping methods. Only articles that were published between March 2020 and March 2021 are extracted. The explanatory variables in this research result from the wordembedding and Bag-of-Word process, while the response variable is the sentiment category (positive, negative, and neutral) The research is conducted in three stages: data acquisition, pre-processing, and modelling. In the data acquisition stage, sampling and web scraping were employed to get news article data. Data pre-processing was done by employing data cleansing, n-gram tokenization, term weighting using TF-IDF, word embedding (word2vec) and data labelling. The models used in this research are multinomial logistic regression by applying L1 and L2 regularization, Support Vector Machine (SVM) with Radial Basic Function (RBF) kernel, and Convolutional Neural Network (CNN). Then, the model is evaluated by using precision, recall, and F1- scores. In order to determine the best model, hypothesis testing is employed by using two-way MANOVA and Tukey testing. Based on the results of topic modeling and positive negative word counting, the news is grouped into nine group. Based on the grouping results, researcher decided to label 1500 news. News selection was made randomly in each group. A linguistic expert will do the labeling process. This process will be repeated three times by three different people, and the sentiment will then be determined based on the voting results of the three annotators. If all three annotators labeled an article differently, a different annotator would re-label it. Based on the labeling result, news sentiment about Covid-19 in the March 2020-March 2021 period tends to be neutral. It shows that most news portals only provide information, which can be in the form of updates on Covid-19 cases or information related to government policies. In the modeling process, 10-fold cross-validation was employed. Based on the modeling result, each combination of model with two types of explanatory variables showed varying recall, precision, and F1-scores results, so further analysis was needed to evaluate model performances. Hypothesis testing was performed using two-way MANOVA to determine the best model. Regarding the effect of the fold on model performance, two-way MANOVA with group design was used to evaluate the mode performances. Based on hypothesis result, p-value of model and explanatory variable is less than a = 0,05. It can be concluded that model and explanatory variable type affect the model performance (recall, precision, F1-scores). Meanwhile, their interaction has a p-value greater than = 0,05. It indicates insufficient evidence to conclude that interaction influences model performance. According to the Tukey test result, CNN's recall, precision, and F1-scores value are higher than the others. It can be concluded that CNN is the best model for sentiment modelling in this research. Then, based on the confidence interval for the difference in the mean goodness-of-fit model with word embedding and TF-IDF weighting, it can be concluded that modelling with word embedding produces a better goodness-of-fit model than modelling with TF-IDF weighting.
URI:	http://repository.ipb.ac.id/handle/123456789/132520
Appears in Collections:	MT - Mathematics and Natural Science

Files in This Item:

File	Description	Size	Format
Cover, Lembar Pengesahan, Prakata, Daftar Isi.pdf Restricted Access	Cover	414.26 kB	Adobe PDF	View/Open
G1501202063_Rahmatin Nur Amalia.pdf Restricted Access	Full Text	2.36 MB	Adobe PDF	View/Open
Lampiran.pdf Restricted Access	Lampiran	361.36 kB	Adobe PDF	View/Open

Show full item record Recommend this item

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets