Words as Features for Learning - Natural Language Processing With Python and NLTK p.12

preview_player
Показать описание
For our text classification, we have to find some way to "describe" bits of data, which are labeled as either positive or negative for machine learning training purposes.

These descriptions are called "features" in machine learning. For our project, we're just going to simply classify each word within a positive or negative review as a "feature" of that review.

Then, as we go on, we can train a classifier by showing it all of the features of positive and negative reviews (all the words), and let it try to figure out the more meaningful differences between a positive review and a negative review, by simply looking for common negative review words and common positive review words.

Рекомендации по теме
Комментарии
Автор

Disabled my ad blocker just for you Sentdex. Amazing work and thank you so much for clear and concise explanations! :) <3

goodjihad
Автор

For beginners: There is already a function present in sci-kit learn to get the feature words
Use this chunk of code-

from import CountVectorizer
vectorizer = CountVectorizer(analyzer = )


It also performs the count vectorization of the sentences.

SurajKumar-bwoi
Автор

Love these tutorials. They are fantastic!! But I went through this one 3 times (rewinding a lot each time) and I don't understand about 1/4 of it. A diagram or flow-chart might help a lot, I think. Even though I do plan to run it with lots of print statements. I haven't done that yet - my bad. I'm sure seeing the contents of the data-structures that are created might help a lot.

davidsimmonds
Автор

for your "word_features = I don't think this give you the top 3000 features since list(all_words.keys()) does not have any order inside, yes?
Maybe we can use something like "word_features = [w[0] for w in sorted(all_words.items(), key=lambda (k, v):v, reverse=True)[:3000]]"

linsongchu
Автор

At 1:22, 'all_words.keys()' doesn't return the keys sorted from highest frequency to lowest, so we are not getting the top 3000 words but instead the first 3000 unique words from the dataset.
To get the list of top 3000 words we can use:
'word_features = [tupl[0] for tupl in all_words.most_common(3000)]'
since all 'all_words.most_common()' returns a list of tuples (each tuple containing the word and its count)

aravindsivalingam
Автор

I am not able to under stand some portion of it.

while doing

all_words = nltk.FreqDist(all_words) 
word_features = all_words.keys()[0:3000]

in 1st step we get a dictionary of words but words are NOT arranged by their frequency count. So all_words.keys()[0:3000] may contain useless words like ', ', '.', '-' etc.

To get a better feature set we can do something like this

stpwrd = dict((sw, True) for sw in stopwords.words('english'))
all_words = [w.lower() for w in movie_reviews.words()] if len(w) > 3 and not stpwrd.get(w)]

In this way we not only remove any word shorter than 3 character but also the stopwords. Hopefully you will do something like this in later videos. please comment if I am doing something wrong.

Amit-pfri
Автор

Why can't we remove the punctuation and the stopwords? That way we will get only the "important" words.

Catatafish
Автор

M not getting the actual logic behind this !!!
can anyone explain in detail about this ...

ishanpatil
Автор

This line : featuresets = [(find_features(rev), category) for (rev, category) in documents]
can be re written as ->

featuresets = [ ]
for d in documents:
dtuple = (find_features(d[0]), d[1])
featuresets.append(dtuple)

featuresets = [ ] : Creating an empty list
for d in documents: = Accessing every element of documents. Please remember that documents is a list of tuples where each tuple contain 2 elements i.e. first element is list of words of every review & second element is its category (pos/neg)
dtuple = It is temporary tuple which will contain 2 element i.e a list returned by function find_features & its category(pos/neg)
d[0] = first element of each tuple in document
d[1] = Second element of each element in document
featuresets.append(dtupple) = Appending every temporary dtuple to featuresets

physicsgurukul
Автор

Harrison, thank you so much for your videos, I am learning a lot with them! :)

I am a beginner and I am trying to determine who won a lawsuit based on some sentences. So I am importing data from excel and I followed the steps from your tutorial, but I couldn't adapt this line:

featuresets = [(find_features(rev), category) for (rev, category) in documents]

how can I access data from some sort of "rev" and "category" from excel?

gabiayako
Автор

Thankyou for all the hard work you are doing. It really helped me alot. God Bless.

aliakbarsiddiqui
Автор

Just a question please: what should I do if I would like to build a strong text classifier by taking into account the lemmatization instead of stemming or taking both; any suggestion :)
Thanks and please continue giving more useful tutorial like this, I like all of your tutorial, , it is great!

Bests,
Mohammed

myWorldDiscover
Автор

Could you possible explain the logic behind this line of code? I understand up until this point:

featuresets = [(find_features(rev), category) for (rev, category) in documents]

rohinmahesh
Автор

If a term is used less frequency but it is super important for the classification will do any of the algos in your next video take that into account?

andrewdennis
Автор

Correct me if I'm wrong.
all_words contain a list of words from all the 2000 movie reviews with their frequency.We built find_features on the basis of this. In the following videos we use the same method to built up a classifier by training it on first 1900 samples. But already the words in test samples were used in all_words so does this means that we have trained the classifier over test samples also.
I tried to run it over a Shawshank's review and it is giving it a <neg>. Something's really wrong.

sanyamgupta
Автор

A pythonic version of find_features():


def find_features(document):
    words = set(document)
    return dict((w, True if w in words else False) for w in word_features)

Amit-pfri
Автор

sentdex Why does nltk.FreqDist() return the words in a different order every time I run it (with the same dataset)?
And thanks a lot for these tutorials btw!

RutgerdeKnijf
Автор

I can't understand from which line that the code separate between positive and negative words?

joxa
Автор

Thanks a lot man for all the tutorials!

Why does word_features = list(all_words.keys())[:3000] return different 3000 words every time? It is supposed to return the 3000 most frequent words right? what is .keys() for? Thanks :D

ajlu
Автор

required to create a model that can discriminate between English, Afrikaans, and Dutch phrases. A labelled dataset of phrases is provided in a csv file. how can i better solve this problem?

codeerrors
visit shbcf.ru