TF-IDF in Python with Scikit Learn (Topic Modeling for DH 02.03)

Показать описание

In this video, we look at how to do tf-idf in Python with Scikit Learn.

GitHub repo:

Scikit Learn docs:

Sources:

If you enjoy this video, please subscribe. I provide all my content at no cost. If you want to support my channel, please donate via

If there's a specific video you would like to see or a tutorial series, let me know in the comments and I will try and make it.

You can follow me at:

Рекомендации по теме

Комментарии

Just for future viewers: apparently the function get_feature_namest() for the vectorizer is now deprecated; when version 1.2 of sklearn is released the function will be completely removed (thus breaking the code in this video). The new standard function to use is to change the line to: In my (albeit limited) set of tests, I received no different results using either function.

wolfofthelight

really good tutorial! Tried to apply this approach on users feedback data, and it's worked. thanks a lot!

kozobrod

OMG, I was so impatient, I went ahead and corrected the cluster underscore issue after an hour of debugging, and there it was in the videos 😂

rohith

Hi William, thank you for the great tutorials and all the effort you put into this resource! When playing around with the clusters, I wondered what's the best way to go back from a selection of words in a cluster to the original texts that are part of the actual topic. I tried a simple for-loop with multiple conditions on the data – that worked for me but didn't seem so efficient … how do you browse back and forth between clusters and texts? Thanks so much, Sebastian

sebastianpranz

You explain complex concepts clearly and succinctly. I need to analyse 200 pdfs converted to txt files. Each of the files are one academic paper. These papers are each between 5000-7000 words, so I am working with a large corpus . I want to identify common themes and how they may link together. Do I create one CSV file with all the documents collated? Any suggestions on how to prepare this data? I could use a summariser but I may loose essential points. I am new to your channel and have not seen any of your other content. If the first video started with how to prepare the documents for analysis it would have been helpful. Applying your tutoring to my data saves a lot of time.

johnnybloem

You imported a metric from sklearn but didn’t cover it in this video. I’ve used a clustering metrics before and the metric shows how well defined each one is compared to others. Is that what would have been done here?

sarasharick

Hi ! im having a problem with the code:

with open ("results.txt", "w", encoding='utf-8') as f:
for i in range(true_K):
f.write(f"Cluster [i]")
f.write("\n")
for ind in order_centroids[i, :10]:
f.write(' &s' % terms [ind], ) ###error here
f.write("\n")

TypeError: not all arguments converted during string formatting

What should i do? i cant find an answer onlne. Thanks!

juancruzbayonas

Could you please share the text file that that’s been used in this video?

geetharagiphani

Why did you find the appropriate value for number of clusters to be 20? Please elaborate on this. Is there a way to find it? Also, Great Video!

sameerpatel

Is there a way to find an optimal number of clusters?

freedman

what are the documents here? one description is one document, is it?

xhitijchoudhary

So for some reason, my code would hang while doing the cleanup, and I figured out that the line while “ “ in final, is where the code was hanging. I wonder if there’s another way to remove the double spaces. I know this is an old video but was just wondering lol, thanks for the content!

joann

Any chance that you show an application that pulls data from a csv? I'm trying to follow your tutorial with my own data, but I rarely have data in JSON and I'm failing to get my data in a form that allow me to follow your work. (I'm too indoctrinated in pandas and tidyverse)

michaeldavies

Really enjoyed this video! I wondered how I would be able to calculate Tf-IDF for multiple documents? I believe in your video you only did tf-idf on one document?

Vickysong

I am very excited about this series of tutorials as your focus is almost exactly what I have been looking for. However I am hitting a couple of snags.

1. I try to install the numpy file you cite. I get the following error
(sklearn-env) C:\>pip install
ERROR: is not a supported wheel on this platform.
I tried other files win32.

2. On video 6 TF-IDF (topic modeling) I follow along in Jupyter Notebook till I get to the (~4'53") and I try to run the following code
description =
print(descriptions[0])

I get the following error

FileNotFoundError Traceback (most recent call last)
in <module>
----> 1 description =
2 print(descriptions[0])

in load_data(file)
1 def load_data(file):
----> 2 with open (file, "r", encoding="utf-8") as f:
3 data = json.load(f)
4 return (data)
5

FileNotFoundError: [Errno 2] No such file or directory: 'data/trc-dn.json'

I know this because the file is not loaded where it expects it to be. I have loaded the file several different places but not where the system is looking for it.

boblucas

Hi, thanks for this, such a great video. I do have a question though. When trying to make the tf-idf vectorizer using my own cleaned corpus I'm getting the error 'AttributeError: 'list' object has no attribute 'lower''. I know this is because i am feeding it a list of lists. I thought it was important that I fed the model a list of lists (where each sub-list is a document) given that tf-idf takes into account individual documents in a whole corpus. of course I could solve this by changing the input, but as I said I thought that it was important that the corpus has within it individual documents (i.e., lists within the list). any idea on how to go about this? Thanks!

alexcrowley

Thanks for a really nice tutorial! I was wondering if this Tf Idf technique would be useful for a corpus of tweets? I can see in the video that a single document in your corpus is fairly long and it makes sense when you extract key terms from it. How would it work with a tweet as a single document which is 280 characters max? Thank you!

to_ra

Can LLMs more easily do the job of TF-IDF?

zj

These videos are really great, I have been looking for a good resource for text classification with python and very happy I came across this so thanks!

I had a question regarding the clusters, since this seems to tokenize and evaluate each word in a given document, is there a good way to go about identifying key-phrases/entities?

For example, if I am looking at medical journals and come across something like "non small cell lung cancer", this whole phrase has a very specific meaning/importance. Is there a way to look for and classify phrases like this? rather than breaking it up into ['non', 'small', 'cell', 'lung', 'cancer'] which could result in each word ending up in a different cluster which won't return much logical value?

alexwinquist

I'm not a big fan of using stopwords. I say that so that my biases are clear without going into an extended rant as to why. Why are you including months in your stopwords list? Do you have reason, in advance, to assume there are no seasonal trends of interest in your data?

malikrumi

TF-IDF in Python with Scikit Learn (Topic Modeling for DH 02.03)

TF-IDF in Python with Scikit Learn (Topic Modeling for DH 02.03)

Text Representation Using TF-IDF: NLP Tutorial For Beginners - S2 E6

TF-IDF using scikit-learn in python

TF-IDF in Python using sklearn

What is TF-IDF for Beginners (Topic Modeling in Python for DH 02.01)

Calculate TF IDF using sklearn for n grams in python

TF-IDF Vectorizer in Scikit-learn for Human Emotion Detection Using NLP Techniques

TfidfVectorizer vs TfidfTransformer in Python's Scikit Learn

TF-IDF using Python's Scikit Learn in Hindi

Mastering Support Vector Machines with Python and Scikit-Learn

Text Representation using Sklearn | one hot encoding, bag of words, bag of n-grams, tf-idf

TF-IDF Vectorization - Theory and Implementation in Python (Part 1/2)

TF-IDF Implementation with Python | Machine Learning | Natural Language Processing

Understanding Pipeline in Machine Learning with Scikit-learn (sklearn pipeline)

3 methods for TF.IDF: Orange (no code), sklearn (some code) and from scratch using python TUTORIAL

Feature Extraction in Scikit Learn

TF-IDF implementation using Python| Python code for TF-IDF| Step by step code for TF-IDF in python

nlp python scikit learn

PYTHON : tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer

End to End Text Classification using Python and Scikit learn

PYTHON : How is the TFIDFVectorizer in scikit-learn supposed to work?

TF-IDF from Scratch in Python

Text Classification in Python Sentiment Analysis with scikit learn Naive Bayes & TF IDF

Mastering the TF-IDF Vectorization: Passing Lists from Pandas DataFrames to Scikit-Learn