TF-IDF in Python with Scikit Learn (Topic Modeling for DH 02.03)

preview_player
Показать описание
In this video, we look at how to do tf-idf in Python with Scikit Learn.

GitHub repo:

Scikit Learn docs:

Sources:

If you enjoy this video, please subscribe. I provide all my content at no cost. If you want to support my channel, please donate via

If there's a specific video you would like to see or a tutorial series, let me know in the comments and I will try and make it.

You can follow me at:
Рекомендации по теме
Комментарии
Автор

Just for future viewers: apparently the function get_feature_namest() for the vectorizer is now deprecated; when version 1.2 of sklearn is released the function will be completely removed (thus breaking the code in this video). The new standard function to use is to change the line to: In my (albeit limited) set of tests, I received no different results using either function.

wolfofthelight
Автор

really good tutorial! Tried to apply this approach on users feedback data, and it's worked. thanks a lot!

kozobrod
Автор

OMG, I was so impatient, I went ahead and corrected the cluster underscore issue after an hour of debugging, and there it was in the videos 😂

rohith
Автор

Hi William, thank you for the great tutorials and all the effort you put into this resource! When playing around with the clusters, I wondered what's the best way to go back from a selection of words in a cluster to the original texts that are part of the actual topic. I tried a simple for-loop with multiple conditions on the data – that worked for me but didn't seem so efficient … how do you browse back and forth between clusters and texts? Thanks so much, Sebastian

sebastianpranz
Автор

You explain complex concepts clearly and succinctly. I need to analyse 200 pdfs converted to txt files. Each of the files are one academic paper. These papers are each between 5000-7000 words, so I am working with a large corpus . I want to identify common themes and how they may link together. Do I create one CSV file with all the documents collated? Any suggestions on how to prepare this data? I could use a summariser but I may loose essential points. I am new to your channel and have not seen any of your other content. If the first video started with how to prepare the documents for analysis it would have been helpful. Applying your tutoring to my data saves a lot of time.

johnnybloem
Автор

You imported a metric from sklearn but didn’t cover it in this video. I’ve used a clustering metrics before and the metric shows how well defined each one is compared to others. Is that what would have been done here?

sarasharick
Автор

Hi ! im having a problem with the code:

with open ("results.txt", "w", encoding='utf-8') as f:
for i in range(true_K):
f.write(f"Cluster [i]")
f.write("\n")
for ind in order_centroids[i, :10]:
f.write(' &s' % terms [ind], ) ###error here
f.write("\n")

TypeError: not all arguments converted during string formatting

What should i do? i cant find an answer onlne. Thanks!

juancruzbayonas
Автор

Could you please share the text file that that’s been used in this video?

geetharagiphani
Автор

Why did you find the appropriate value for number of clusters to be 20? Please elaborate on this. Is there a way to find it? Also, Great Video!

sameerpatel
Автор

Is there a way to find an optimal number of clusters?

freedman
Автор

what are the documents here? one description is one document, is it?

xhitijchoudhary
Автор

So for some reason, my code would hang while doing the cleanup, and I figured out that the line while “ “ in final, is where the code was hanging. I wonder if there’s another way to remove the double spaces. I know this is an old video but was just wondering lol, thanks for the content!

joann
Автор

Any chance that you show an application that pulls data from a csv? I'm trying to follow your tutorial with my own data, but I rarely have data in JSON and I'm failing to get my data in a form that allow me to follow your work. (I'm too indoctrinated in pandas and tidyverse)

michaeldavies
Автор

Really enjoyed this video! I wondered how I would be able to calculate Tf-IDF for multiple documents? I believe in your video you only did tf-idf on one document?

Vickysong
Автор

I am very excited about this series of tutorials as your focus is almost exactly what I have been looking for. However I am hitting a couple of snags.

1. I try to install the numpy file you cite. I get the following error
(sklearn-env) C:\>pip install
ERROR: is not a supported wheel on this platform.
I tried other files win32.

2. On video 6 TF-IDF (topic modeling) I follow along in Jupyter Notebook till I get to the (~4'53") and I try to run the following code
description =
print(descriptions[0])


I get the following error

FileNotFoundError Traceback (most recent call last)
in <module>
----> 1 description =
2 print(descriptions[0])

in load_data(file)
1 def load_data(file):
----> 2 with open (file, "r", encoding="utf-8") as f:
3 data = json.load(f)
4 return (data)
5

FileNotFoundError: [Errno 2] No such file or directory: 'data/trc-dn.json'

I know this because the file is not loaded where it expects it to be. I have loaded the file several different places but not where the system is looking for it.

boblucas
Автор

Hi, thanks for this, such a great video. I do have a question though. When trying to make the tf-idf vectorizer using my own cleaned corpus I'm getting the error 'AttributeError: 'list' object has no attribute 'lower''. I know this is because i am feeding it a list of lists. I thought it was important that I fed the model a list of lists (where each sub-list is a document) given that tf-idf takes into account individual documents in a whole corpus. of course I could solve this by changing the input, but as I said I thought that it was important that the corpus has within it individual documents (i.e., lists within the list). any idea on how to go about this? Thanks!

alexcrowley
Автор

Thanks for a really nice tutorial! I was wondering if this Tf Idf technique would be useful for a corpus of tweets? I can see in the video that a single document in your corpus is fairly long and it makes sense when you extract key terms from it. How would it work with a tweet as a single document which is 280 characters max? Thank you!

to_ra
Автор

Can LLMs more easily do the job of TF-IDF?

zj
Автор

These videos are really great, I have been looking for a good resource for text classification with python and very happy I came across this so thanks!

I had a question regarding the clusters, since this seems to tokenize and evaluate each word in a given document, is there a good way to go about identifying key-phrases/entities?

For example, if I am looking at medical journals and come across something like "non small cell lung cancer", this whole phrase has a very specific meaning/importance. Is there a way to look for and classify phrases like this? rather than breaking it up into ['non', 'small', 'cell', 'lung', 'cancer'] which could result in each word ending up in a different cluster which won't return much logical value?

alexwinquist
Автор

I'm not a big fan of using stopwords. I say that so that my biases are clear without going into an extended rant as to why. Why are you including months in your stopwords list? Do you have reason, in advance, to assume there are no seasonal trends of interest in your data?

malikrumi
welcome to shbcf.ru