Ekstraksi Informasi Interaksi Senyawa Protein pada Kumpulan Dokumen Abstrak yang Mengandung Kalimat Majemuk

Afriza, Aulia

Please use this identifier to cite or link to this item: http://repository.ipb.ac.id/handle/123456789/105825

Title:	Ekstraksi Informasi Interaksi Senyawa Protein pada Kumpulan Dokumen Abstrak yang Mengandung Kalimat Majemuk
Other Titles:	Information Extraction of Compound-Protein Interaction in A Collection of Abstract Documents Contains Compound Sentences
Authors:	Kusuma, Wisnu Ananta Annisa Afriza, Aulia
Issue Date:	Feb-2021
Publisher:	IPB University
Abstract:	Drug Target Interactions (DTI) adalah proses penting dalam penemuan obat di mana bertujuan untuk mengidentifikasi senyawa yang berguna bagi pengobatan atau penyembuhan penyakit. Informasi terkait penelitian Drug Target Interactions sebagian besar terdapat dalam public domain database dan literatur atau paper peneliti. Informasi DTI di public domain database masih terbatas disebabkan tidak banyak peneliti menginputkan data hasil penelitiannya ke dalam public domain database. Untuk mendapatkan informasi DTI tersebut diperlukan cara lain, yaitu dengan melakukan ekstraksi informasi dari literatur atau paper untuk mengambil informasi-informasi terkait penelitian drug target interactions. Informasi pada dokumen abstrak paper peneliti banyak yang mengandung kalimat majemuk. Regular expression dapat digunakan untuk mengidentifikasi kalimat majemuk dalam dokumen abstrak. Penelitian ini menggunakan regular expression untuk identifikasi kalimat majemuk dan menggunakan pendekatan machine learning untuk mengklasifikasikan teks yang memiliki informasi interaksi senyawa-protein. Penelitian ini menggunakan kumpulan dokumen abstrak dari penelitian Muztahid (2019), di mana terdapat 3,000 dokumen abstrak yang diurai menjadi 29,363 kalimat. Kalimat yang telah diurai selanjutkan dilakukan tahapan Named Entity Recognition(NER). Dengan menerapkan NER, dari total 29,363 kalimat, terdeteksi 7,653 kalimat yg mengandung senyawa dan atau protein. Pada tahapan selanjutnya, dengan regular expression menggunakan pattern matching dihasilkan 17,451 kalimat tunggal yang akan diproses ke tahapan text preproccessing. Kalimat yang diperoleh dari tahapan text preproccessing dibagi menjadi data latih dan data uji dengan menggunakan 10-fold cross validation. Proses identifikasi interaksi antara senyawa dan protein dari abstrak yang mengandung kalimat majemuk dilakukan memggunakan Bernoulli Naive Bayes. Hasil evaluasi menunjukkan akurasi rata-rata sebesar 76.44%. Drug Target Interactions (DTI) is an important process in drug discovery where it aims to identify compounds useful for the treatment or cure of diseases. Information related to Drug Target Interactions research is mostly contained in public domain databases and literature or research papers. DTI information in public domain databases is still limited because not many researchers input data from their research into the public domain database. To get the DTI information is needed another way, by extracting information from literature or paper to retrieve information related to drug target interactions research. The information on the abstract paper documents of many researchers contain compound sentences. Regular expression can be used to identify compound sentences in abstract documents. This study used regular expression for compound sentence identification and uses machine learning approach to classify text that has compound-protein interaction information. This study used a collection of abstract documents from Muztahid (2019), there are 3,000 abstract documents has been parsed into 29,363 sentences. The sentence that has been parsed is further carried out the Named Entity Recognition (NER) stage. By applying NER, out of a total of 29,363 sentences, 7,653 sentences were detected containing compounds and or proteins. In the next stage, with regular expression using pattern matching produced 17,451 single sentences that will be processed to the text preprocessing stage. Sentences obtained from the text preprocessing stage are divided into training data and test data using 10-fold cross validation. The process of identifying interactions between compounds and proteins from abstracts containing compound sentences was carried out using Bernoulli Naive Bayes. The evaluation showed an average accuracy of 76.44%.
Description:	Mohon segera diproses untuk bebas pustaka
URI:	http://repository.ipb.ac.id/handle/123456789/105825
Appears in Collections:	MT - Mathematics and Natural Science

Files in This Item:

File	Description	Size	Format
G651180504-Aulia Afriza_.pdf Restricted Access	Full Teks	10.32 MB	Adobe PDF	View/Open
G651180504-Aulia Afriza - Cover.pdf Restricted Access	Cover	3.77 MB	Adobe PDF	View/Open
G651180504-Aulia Afriza_-Lampiran.pdf Restricted Access	Lampiran	586.47 kB	Adobe PDF	View/Open

Show full item record Recommend this item

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets