End to End Topic Modeling using scikit learn

preview_player
Показать описание
#datascience #nlp #topicmodels

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body

In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar

In this video we will be build a topic model from scratch on consumer complaints dataset
Рекомендации по теме
Комментарии
Автор

Excellent concept put up in simple words and codes.
i really like the coding style in these two lines..


Induraj
Автор

Thank you for the Playlist. Gone through the first video, its quite elaborative. Will try to implement the same and enhance my lerning

thirumal
Автор

Thank you so much for a nice project for a fresher like me

modikai
Автор

Thank you, sir, for explaining step-by-step a used case in Topic modelling. Do we need to do n-gram in the analysis for the complaints?

saimanohar
Автор

I was working on news category classification for which I have to find category of news-based on news text. The problem that I am facing is LDA is shuffling rows and I didn't find any parameter like 'shuffle=false' in order to avoid shuffling. How can we compare the assigned topics to the original rows of the dataset as the rows of the Dataset after applying LDA get shuffled and have assigned topics accordingly?

ZEA_TATA
Автор

Which Should I use amazon aws or Google gcp to deploy my api as a service so that many people can use

chturbhujisingh
Автор

Hi Sir, could you suggest a way of selecting optimum number of topics? Is grid search a good way ?

ashwinpalnitkar
Автор

Are there other ways to perform topic modelling other than LDA?

icudednow
Автор

In train_test_split, you said 60:40 split, but i think you took test_size as 0.6. Just an observation., it became 40.60 split.

adifull
Автор

Thank you for the video. Can you please tell me how we know which method to use between vectorizer and Tfidf. Vectorizer is just count of words in the document but Tfidf will give the score based on frequency and also compared to other documents in the corpus. So can I use Tfidf Everytime?

ashokkumarreddy
Автор

Is there a need for a train and test split? Because there is no learning involved during the training process.

shrikanthsingh
Автор

Hi Sir, scikit-learn seems to give great results with good estimate on topic number. Kindly suggest topic number optimisation process? What I've read and seen yet is, perplexity and log-likelihood are not the best measures for computing optimum topic numbers. Please suggest a better metric.

avisankhadutta
Автор

how can we classify the topics..what to map...(the names of topic0, topic1)

tanvigupta
Автор

Sir I am getting a 404 error while downloading the data set. I think the dataset is not public in your github

Cricketpracticevideoarchive
Автор

Sir can you give real world use case of name entity recognition..

Iam unable imagine the how this technic is used in real world

shaikrasool
Автор

can anyone suggest me!.
ex: I have 20k customer feedbacks(ratings are not available). I need to classify the given review.
so can I use this model to create labels for each review? (positive or negative).
then we can build a classification model using those ratings and reviews.

maYYidtS
Автор

Great Tutorial Sir.
I had a doubt on .component_
Here is the explanation of what lda.compenents_ do from the documentation
'''
components_ :
Variational parameters for topic word distribution.
Since the complete conditional for topic word distribution is a Dirichlet,
components_[i, j] can be viewed as pseudocount that represents the number of
times word j was assigned to topic i. It can also be viewed as distribution
over the words for each topic after normalization:
model.components_ /
model.components_.sum(axis=1)[:, np.newaxis].
'''
This is my understanding
'''H1 basically gives the how any times that word was assigned for the topic'''


What should be our intution behind understanding that .components_ part ?

sushantpenshanwar