Please use this identifier to cite or link to this item: https://zone.biblio.laurentian.ca/handle/10219/3847
Title: Distinguishing fake and real news of twitter data with the help of machine learning techniques
Authors: Shah, Aanan
Keywords: twitter;fake news;real news;data;machine learning techniques
Issue Date: 30-Mar-2021
Abstract: News articles have an influence on people's belief and views about various circumstances. In this regard, some news publishers with political or ideological bias try to spread news which are distorted or totally wrong. This thesis intends to develop a machine learning model that identifies fake news and original news by taking aid from natural language processing. Natural language processing was used to preprocess the text. Some general features like, number of words, sentences, stopwords, non-alphabetic words, verbs, nouns, and adjectives were identified. The stopwords and hyperlinks were removed to clean the text data. In the preprocessing step after cleaning the data and removing the stopwords, the position of each word was concatenated with the word itself. This procedure helps in distinguishing between a word as a noun, a pronoun, an adjective or a verb in the sentences. After preprocessing, feature extraction methods were used for converting the text of news to analyzable data. The frequency of the words in each article was used for filtering out the non-informative words. Three feature extraction methods were used in this study namely, count vectorizer, Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer and word2vec embedding. It was observed that the results obtained by TF-IDF feature extraction method were superior compared with the other two methods. After feature extraction, various machine learning models were used for training the model namely, Naive Bayes, Logistic Regression, Random Forest, K-nearest neighbors (KNN) and Support Vector Machine (SVM). The Recurrent Neural Network (RNN) was also used as a deep learning model. The model was successfully tested on two datasets. On the first dataset, SVM achieved an accuracy of 98.5% and RNN achieved an accuracy of 98.03% which is much improvement over the best results of Agarwalla et al., 2019 (83.16 % accuracy). On the second dataset, SVM achieved an accuracy of 97.76%, RNN achieved 97.1% and Logistic Regression achieved 97.50% which is an improvement over the best results of Vijayraghavan et al. 2020 (94.88% accuracy).
URI: https://zone.biblio.laurentian.ca/handle/10219/3847
Appears in Collections:Computational Sciences - Master's theses

Files in This Item:
File Description SizeFormat 
Thesis FINAL - Aanan Shah - 03-Mar-2021.pdf1.49 MBAdobe PDFThumbnail
View/Open


Items in LU|ZONE|UL are protected by copyright, with all rights reserved, unless otherwise indicated.