Please use this identifier to cite or link to this item: https://zone.biblio.laurentian.ca/handle/10219/3684
Title: Opinion mining of online users’ comments using Natural Language Processing and machine learning
Authors: Pazooki, Anahita (Elham)
Keywords: Data mining;opinion mining;Natural Language Processing (NLP);data pre-processing;word tokenization;stemming;term frequency-inverse document frequency;supervised machine learning;random forest;gradient boosting;decision trees;SVMs;gini-index;artificial neural network
Issue Date: 28-Aug-2020
Abstract: With the widespread popularity of World Wide Web, increasing number of people are active on social media and websites to post their opinions towards products or special events or to make decisions based on the opinions and experiences of people on social media. These Online opinions are unstructured or structured textual data containing insignificant as well as significant information which has attracted attention of researchers to extract knowledge from such textual data. Opinion mining and Natural Language Processing (NLP) techniques help to find information through the huge number of reviews in the form of unstructured comments. In this research a model is proposed for classification of online user’s feedback and opinions to improve the accuracy and precision of the classification in comparison to the existing research on the same dataset. More-precisely, in this research, Natural Language Processing (NLP) techniques as well as various supervised machine learning techniques are used to classify users’ opinions. The performances of all the classifiers are evaluated to find the best performance. The data set contains 689 comments extracted from the users' comments from Amazon.com, collected and annotated by Minqing Hu and Bing Liu. The selected comments are about the product “Speakers” on Amazon.com. Each comment is written by one user and it has a certain label that shows the author's desire to comment. This label can be classified as "positive", "negative" or "neutral". The data is provided in the form of XML file, a semi-structured format. The opinions are processed using natural language processing techniques, for instance by removing punctuations, removing URLs, removing numbers, removing spaces, removing stop-words, and their features are extracted using natural language processing techniques, for example, Word Tokenization, Stemming and Bag of words and Bag of N-grams and Term Frequency-Inverse Document Frequency (TF_IDF). The proposed method was implemented using Python programming language and Natural Language Toolkit (NLTK) and other libraries in python. The proposed model gave a peak of 88% precision by Random Forest with 140 trees and bigram feature space. Also, Random Forest, Gradient Boosting, Artificial Neutral Network, and SVM gave 87% precision for trigram feature space.
URI: https://zone.biblio.laurentian.ca/handle/10219/3684
Appears in Collections:Computational Sciences - Master's theses
Master's Theses

Files in This Item:
File Description SizeFormat 
Anahita(Elham) Pazooki_FINAL Thesis_4Sep2020.pdf2.55 MBAdobe PDFView/Open


Items in LU|ZONE|UL are protected by copyright, with all rights reserved, unless otherwise indicated.