Fake News Detection Intro using Machine Learning (ML) Models and Natural Language Processing (NLP)

preview_player
Показать описание
Fake news is all around us – whether we can identify it or not. Individuals and organizations publish fake news all the time, whether it be for a persuasion tactic, or to simply override unfavorable truths. Take the search for a Covid-19 vaccine for example, an issue that is especially relevant in our current times. Before a vaccine came out, there were some sources that stated there was already a fully effective vaccine available, some that stated it was coming very soon, and others that stated that it would take decades for a safe and functional one to be released. And trusting and following the wrong source can lead to more harm than good.

Now the question becomes, which websites do we trust, and which do we ignore? In most cases, it might not always be this clear as to which sites to trust or reject, and which sites are real or fake.

Fortunately, Big Data can save the day for us! In today’s world of ever growing data streams, one can imagine crunching through the volumes of data to detect patterns, which can then be analyzed to separate out real news from fake.

That is exactly the project I executed on - a fake news detection machine learning model that utilizes advanced natural language processing techniques to classify news websites as either fake or real.

This machine learning model utilizes binary classification to identify whether a news site is fake or real, in which an output of ‘1’ indicates that the website is most likely fake and ‘0’ indicates that the site is indeed trustworthy. It will take in a list of website URLs and corresponding raw HTML as input data and will train a logistic regression model to output a label of either 0 or 1 depending on whether the website is real or fake.

The core of this model comes in the form of the various natural language processing techniques deployed to transform the input data, previously in the form of words, into numbers that the machine can understand and learn from. I have transformed this data by creating and importing several functions generally referred to as featurizers. The purpose of these featurizers is to extract key features of the URL and HTML that may help predict the trustworthiness of the site and transform the data into numerical values to input into the logistic regression model.

To obtain the data necessary for my model, I scraped the web for news websites and compiled a set of *2557 sites, consisting of roughly 50% fake and 50% real. I then split my data into a training set, cross-validation set, and a test set.

I created my first baseline featurizer to be a domain featurizer that extracts basic features from the domain name extension of each website. This domain featurizer takes in a URL and an HTML and returns a dictionary mapping the feature descriptions to numerical features. The accuracy of this model was only 55%, which was not surprising as the domain extension, while might provide some clues, cannot be the deterministic predictor of a website’s trustworthiness.

The key problem with this model is that there is simply not enough information. To combat this issue, I decided my next step would be to make use of specific (and potentially predictive) keywords of the HTML in addition to the domain extension to feed into the logistic regression model. After a series of steps, I used a logistic regression model to get an accuracy of 73%.

The model performed considerably better than the domain method, but as this is still a relatively simple method, I started to think of more nuanced approaches. The meta descriptions of a website’s HTML is a great source of information conveying the core content of that website. As an improvement to my last keyword featurizer, I used the Bag-of-Words NLP model. Once I obtained my score reports for this model, I observed that all of the metrics yielded much higher percentages that before.

Now a shortcoming of the bag-of-words model was that it only looked at the counts of words in the description for each website. But then I pondered if there was a way to somehow understand the meaning of the words in the descriptions for each site. This is where word vectors come in. I utilized a model called GloVe, to accomplish this task. This model yielded an accuracy of about 87%.

Given that I tried out several different featurizers and observed the score reports for each, I was curious to find out if I would obtain improved results when I combine all of the featurization approaches. I then passed the concatenated vector into my logistic regression model and obtained an accuracy of 91%, which was the highest yet.

It was time to test it out on the unseen test data to obtain the real accuracy of my model. I obtained the score reports and observed my model predicted the trustworthiness of news websites with 91% accuracy.

As with any machine learning model, there are places for improving the score metrics even further, such as obtaining a larger dataset, developing more featurization approaches, etc.
Рекомендации по теме
Комментарии
Автор

Useful and impressive video, it was good to hear about your model, thankyou for sharing.

joydeepmitra
Автор

It is good that we have this kind of machine who can filter fake news from real news and the accuracy is amazing.

kawaiinoona
Автор

great video content and a very informative one at that..

irishjoydiwa
Автор

very helpful information for people thank you and keep it up!

abriojhonpouly.
Автор

The fact that you studied properly for this awesome information just can blow everyone's mind. I hope everyone will see this and make it to the top.

mariels
Автор

Thankyouuuu for your sensible thoughts and warning for us!

mariegracesoltones
Автор

Good information for the society about the fake news, I like the information

silambusaravanan
Автор

Wow! This is really helpful and interesting!!!

SoumikaAichRoy
Автор

This video is very informative which give us important views on how to detect fake news. We all know that fake news is very well-known today. Through this video everyone will know what is the bad effect of fake news and how to handle fakes using this devices.

judillewaling
Автор

Thank you for sharing this informative video. Yes this is very helpful for us.I always wanted to know about AI.

RumpaPal-wt
Автор

Interesting. Thanks for the information.

domzkijane
Автор

This is very good information. Thanks 👍👍

wfr
Автор

this is really helpful. to learn whether it is fake or not is a must especially for today's situation all over the world. with the accuracy of 91%, it is also very impressive and the way it was explained so that others would understand it very clearly.

chan
Автор

Impressed about the accuracy level 91% of this model. Great info! Thank you so much for the explanation.

SachinMadushan
Автор

Nice! I want to create programs like this too! hoping I would finish my training in IT. Good video!

elamir
Автор

It's really very interesting and helpful to detect fake news ! thank for sharing such valuable information .

christinemarie
Автор

Very helpful information for me thank you and keep it up!

redsky
Автор

It was indeed a great video to watch to, Thank you for this video, I am hoping for more videos like this.

princessfiller
Автор

this video is really made of good purpose in social aspects.

neststudios
Автор

This is very good and helpful information, thank you for sharing

icarusfalls
visit shbcf.ru