Fake News Detection Intro using Machine Learning (ML) Models and Natural Language Processing (NLP)

Показать описание

Fake news is all around us – whether we can identify it or not. Individuals and organizations publish fake news all the time, whether it be for a persuasion tactic, or to simply override unfavorable truths. Take the search for a Covid-19 vaccine for example, an issue that is especially relevant in our current times. Before a vaccine came out, there were some sources that stated there was already a fully effective vaccine available, some that stated it was coming very soon, and others that stated that it would take decades for a safe and functional one to be released. And trusting and following the wrong source can lead to more harm than good.

Now the question becomes, which websites do we trust, and which do we ignore? In most cases, it might not always be this clear as to which sites to trust or reject, and which sites are real or fake.

Fortunately, Big Data can save the day for us! In today’s world of ever growing data streams, one can imagine crunching through the volumes of data to detect patterns, which can then be analyzed to separate out real news from fake.

That is exactly the project I executed on - a fake news detection machine learning model that utilizes advanced natural language processing techniques to classify news websites as either fake or real.

This machine learning model utilizes binary classification to identify whether a news site is fake or real, in which an output of ‘1’ indicates that the website is most likely fake and ‘0’ indicates that the site is indeed trustworthy. It will take in a list of website URLs and corresponding raw HTML as input data and will train a logistic regression model to output a label of either 0 or 1 depending on whether the website is real or fake.

The core of this model comes in the form of the various natural language processing techniques deployed to transform the input data, previously in the form of words, into numbers that the machine can understand and learn from. I have transformed this data by creating and importing several functions generally referred to as featurizers. The purpose of these featurizers is to extract key features of the URL and HTML that may help predict the trustworthiness of the site and transform the data into numerical values to input into the logistic regression model.

To obtain the data necessary for my model, I scraped the web for news websites and compiled a set of *2557 sites, consisting of roughly 50% fake and 50% real. I then split my data into a training set, cross-validation set, and a test set.

I created my first baseline featurizer to be a domain featurizer that extracts basic features from the domain name extension of each website. This domain featurizer takes in a URL and an HTML and returns a dictionary mapping the feature descriptions to numerical features. The accuracy of this model was only 55%, which was not surprising as the domain extension, while might provide some clues, cannot be the deterministic predictor of a website’s trustworthiness.

The key problem with this model is that there is simply not enough information. To combat this issue, I decided my next step would be to make use of specific (and potentially predictive) keywords of the HTML in addition to the domain extension to feed into the logistic regression model. After a series of steps, I used a logistic regression model to get an accuracy of 73%.

The model performed considerably better than the domain method, but as this is still a relatively simple method, I started to think of more nuanced approaches. The meta descriptions of a website’s HTML is a great source of information conveying the core content of that website. As an improvement to my last keyword featurizer, I used the Bag-of-Words NLP model. Once I obtained my score reports for this model, I observed that all of the metrics yielded much higher percentages that before.

Now a shortcoming of the bag-of-words model was that it only looked at the counts of words in the description for each website. But then I pondered if there was a way to somehow understand the meaning of the words in the descriptions for each site. This is where word vectors come in. I utilized a model called GloVe, to accomplish this task. This model yielded an accuracy of about 87%.

Given that I tried out several different featurizers and observed the score reports for each, I was curious to find out if I would obtain improved results when I combine all of the featurization approaches. I then passed the concatenated vector into my logistic regression model and obtained an accuracy of 91%, which was the highest yet.

It was time to test it out on the unseen test data to obtain the real accuracy of my model. I obtained the score reports and observed my model predicted the trustworthiness of news websites with 91% accuracy.

As with any machine learning model, there are places for improving the score metrics even further, such as obtaining a larger dataset, developing more featurization approaches, etc.

Рекомендации по теме

Комментарии

Useful and impressive video, it was good to hear about your model, thankyou for sharing.

joydeepmitra

It is good that we have this kind of machine who can filter fake news from real news and the accuracy is amazing.

kawaiinoona

great video content and a very informative one at that..

irishjoydiwa

very helpful information for people thank you and keep it up!

abriojhonpouly.

The fact that you studied properly for this awesome information just can blow everyone's mind. I hope everyone will see this and make it to the top.

mariels

Thankyouuuu for your sensible thoughts and warning for us!

mariegracesoltones

Good information for the society about the fake news, I like the information

silambusaravanan

Wow! This is really helpful and interesting!!!

SoumikaAichRoy

This video is very informative which give us important views on how to detect fake news. We all know that fake news is very well-known today. Through this video everyone will know what is the bad effect of fake news and how to handle fakes using this devices.

judillewaling

Thank you for sharing this informative video. Yes this is very helpful for us.I always wanted to know about AI.

RumpaPal-wt

Interesting. Thanks for the information.

domzkijane

This is very good information. Thanks 👍👍

wfr

this is really helpful. to learn whether it is fake or not is a must especially for today's situation all over the world. with the accuracy of 91%, it is also very impressive and the way it was explained so that others would understand it very clearly.

chan

Impressed about the accuracy level 91% of this model. Great info! Thank you so much for the explanation.

SachinMadushan

Nice! I want to create programs like this too! hoping I would finish my training in IT. Good video!

elamir

It's really very interesting and helpful to detect fake news ! thank for sharing such valuable information .

christinemarie

Very helpful information for me thank you and keep it up!

redsky

It was indeed a great video to watch to, Thank you for this video, I am hoping for more videos like this.

princessfiller

this video is really made of good purpose in social aspects.

neststudios

This is very good and helpful information, thank you for sharing

icarusfalls

Fake News Detection Intro using Machine Learning (ML) Models and Natural Language Processing (NLP)

Fake News Detection Intro using Machine Learning (ML) Models and Natural Language Processing (NLP)

Fake News Detection using Machine Learning based Stance Detection

Fake News Detection Project introduction

🔥Fake News Detection Using Machine Learning | Machine Learning Projects In Python | Simplilearn

Fake News Detection using Python and machine learning

Intro to Text Mining - Algorithms for the Analysis of Fake News

AI and Fake News Detection 📰 - AI FACTS

Fake News Detection in Social Media using Graph Neural Networks and NLP Techniques: A COVID Use-case

Fake news detection using machine learning and Python | Data Science with Python and ML

[Lecture] How can a computer detect fake news? An introduction to FEVER and Fool Me Twice.

Fake News Prediction (Real Time) Using NLP |DL |ML #nlp #machinelearning #deeplearning #ai #project

Real-time Content Detection | FactMeta #fakenews #detection

fake news detection using machine learning || Machine Learning Projects || Python Projects

FAKE NEWS DETECTION | 3/4 BTECH CSE | WISTM.ENGG COLLEGE

Fake News Detection Project with BERT Fine-tuning | Deep Learning for NLP | Project#11

Fake News Detection using ensemble method

Fake News Detection using AI

Fake News Detection Project Presentation

NLP Fake news detection using machine learning algorithms.

Fake News Detection: Data Science Project #datascience #project #python #trendingvideo

Detecting Fake News using Support Vector Machines - Rachel Spear

Fake news detection project Part-1 #machinelearning #ai #shorts

Machine Learning Project - 1 : Fake News Detection (Using Python)

Fake News Detection using Graphs with Pytorch Geometric