Words as Features for Learning - Natural Language Processing With Python and NLTK p.12

Показать описание

For our text classification, we have to find some way to "describe" bits of data, which are labeled as either positive or negative for machine learning training purposes.

These descriptions are called "features" in machine learning. For our project, we're just going to simply classify each word within a positive or negative review as a "feature" of that review.

Then, as we go on, we can train a classifier by showing it all of the features of positive and negative reviews (all the words), and let it try to figure out the more meaningful differences between a positive review and a negative review, by simply looking for common negative review words and common positive review words.

Рекомендации по теме

Комментарии

Disabled my ad blocker just for you Sentdex. Amazing work and thank you so much for clear and concise explanations! :) <3

goodjihad

For beginners: There is already a function present in sci-kit learn to get the feature words
Use this chunk of code-

from import CountVectorizer
vectorizer = CountVectorizer(analyzer = )

It also performs the count vectorization of the sentences.

SurajKumar-bwoi

Love these tutorials. They are fantastic!! But I went through this one 3 times (rewinding a lot each time) and I don't understand about 1/4 of it. A diagram or flow-chart might help a lot, I think. Even though I do plan to run it with lots of print statements. I haven't done that yet - my bad. I'm sure seeing the contents of the data-structures that are created might help a lot.

davidsimmonds

for your "word_features = I don't think this give you the top 3000 features since list(all_words.keys()) does not have any order inside, yes?
Maybe we can use something like "word_features = [w[0] for w in sorted(all_words.items(), key=lambda (k, v):v, reverse=True)[:3000]]"

linsongchu

At 1:22, 'all_words.keys()' doesn't return the keys sorted from highest frequency to lowest, so we are not getting the top 3000 words but instead the first 3000 unique words from the dataset.
To get the list of top 3000 words we can use:
'word_features = [tupl[0] for tupl in all_words.most_common(3000)]'
since all 'all_words.most_common()' returns a list of tuples (each tuple containing the word and its count)

aravindsivalingam

I am not able to under stand some portion of it.

while doing

all_words = nltk.FreqDist(all_words)
word_features = all_words.keys()[0:3000]

in 1st step we get a dictionary of words but words are NOT arranged by their frequency count. So all_words.keys()[0:3000] may contain useless words like ', ', '.', '-' etc.

To get a better feature set we can do something like this

stpwrd = dict((sw, True) for sw in stopwords.words('english'))
all_words = [w.lower() for w in movie_reviews.words()] if len(w) > 3 and not stpwrd.get(w)]

In this way we not only remove any word shorter than 3 character but also the stopwords. Hopefully you will do something like this in later videos. please comment if I am doing something wrong.

Amit-pfri

Why can't we remove the punctuation and the stopwords? That way we will get only the "important" words.

Catatafish

M not getting the actual logic behind this !!!
can anyone explain in detail about this ...

ishanpatil

This line : featuresets = [(find_features(rev), category) for (rev, category) in documents]
can be re written as ->

featuresets = [ ]
for d in documents:
dtuple = (find_features(d[0]), d[1])
featuresets.append(dtuple)

featuresets = [ ] : Creating an empty list
for d in documents: = Accessing every element of documents. Please remember that documents is a list of tuples where each tuple contain 2 elements i.e. first element is list of words of every review & second element is its category (pos/neg)
dtuple = It is temporary tuple which will contain 2 element i.e a list returned by function find_features & its category(pos/neg)
d[0] = first element of each tuple in document
d[1] = Second element of each element in document
featuresets.append(dtupple) = Appending every temporary dtuple to featuresets

physicsgurukul

Harrison, thank you so much for your videos, I am learning a lot with them! :)

I am a beginner and I am trying to determine who won a lawsuit based on some sentences. So I am importing data from excel and I followed the steps from your tutorial, but I couldn't adapt this line:

featuresets = [(find_features(rev), category) for (rev, category) in documents]

how can I access data from some sort of "rev" and "category" from excel?

gabiayako

Thankyou for all the hard work you are doing. It really helped me alot. God Bless.

aliakbarsiddiqui

Just a question please: what should I do if I would like to build a strong text classifier by taking into account the lemmatization instead of stemming or taking both; any suggestion :)
Thanks and please continue giving more useful tutorial like this, I like all of your tutorial, , it is great!

Bests,
Mohammed

myWorldDiscover

Could you possible explain the logic behind this line of code? I understand up until this point:

featuresets = [(find_features(rev), category) for (rev, category) in documents]

rohinmahesh

If a term is used less frequency but it is super important for the classification will do any of the algos in your next video take that into account?

andrewdennis

Correct me if I'm wrong.
all_words contain a list of words from all the 2000 movie reviews with their frequency.We built find_features on the basis of this. In the following videos we use the same method to built up a classifier by training it on first 1900 samples. But already the words in test samples were used in all_words so does this means that we have trained the classifier over test samples also.
I tried to run it over a Shawshank's review and it is giving it a <neg>. Something's really wrong.

sanyamgupta

A pythonic version of find_features():

def find_features(document):
words = set(document)
return dict((w, True if w in words else False) for w in word_features)

Amit-pfri

sentdex Why does nltk.FreqDist() return the words in a different order every time I run it (with the same dataset)?
And thanks a lot for these tutorials btw!

RutgerdeKnijf

I can't understand from which line that the code separate between positive and negative words?

joxa

Thanks a lot man for all the tutorials!

Why does word_features = list(all_words.keys())[:3000] return different 3000 words every time? It is supposed to return the 3000 most frequent words right? what is .keys() for? Thanks :D

ajlu

required to create a model that can discriminate between English, Afrikaans, and Dutch phrases. A labelled dataset of phrases is provided in a csv file. how can i better solve this problem?

codeerrors

Words as Features for Learning - Natural Language Processing With Python and NLTK p.12

Words as Features for Learning - Natural Language Processing With Python and NLTK p.12

Simple Features with Bag of Words for Machine Learning

Words as Features for Learning Natural Language Processing With Python and NLTK p 12

Words Learning Tool - App Features and usage tutorial

Japanese Vocabulary for Beginners｜100 Essential Words with Romaji, Meaning & Audio

Converting words to features

iPadOS 26 First Look - Top 10 HUGE Features

5 Microsoft Word features that will make work from home easier, faster

ABC Learn Spelling with Banana | Fun & Educational Animated Video for Toodlers #abcd #shorts #k...

Microsoft WORD || beginner level tutorial ||All features and content explained in one Video #msword

Learning Of Features Of Word In Microsoft Word

MS-Office Word Features

Professor Answers Neurodiversity Questions | Tech Support | WIRED

Productivity with the A16 iPad and iPadOS 26! Microsoft Office and Google Menu Bars!

Top 10 Useful Features of MS Word Part 1

Street Features in English: Learn Road Safety Words Fast!

Stop Typing! Use THIS Secret Word Feature to Write for You

Pinch? ✌️ SMASH? 💥 Let's Learn How We TOUCH Things! #learning #kidsvideo

AI learns new words and other features in Servo

10 Features You Must Use in Microsoft Word

Want To Speak Like A Native? Use These 5 Simple Words

Learn 2000+ English Words the Smart Way | SMASH Vocabulary Free App

Speak Feature in Ms Word #shorts #msword

MS Word Like a Pro: Hidden Features Unlocked!