BERTopic Explained

preview_player
Показать описание
90% of the world's data is unstructured. It is built by humans, for humans. That's great for human consumption, but it is *very* hard to organize when we begin dealing with the massive amounts of data abundant in today's information age.

Organization is complicated because unstructured text data is not intended to be understood by machines, and having humans process this abundance of data is wildly expensive and *very slow*.

Fortunately, there is light at the end of the tunnel. More and more of this unstructured text is becoming accessible and understood by machines. We can now search text based on *meaning*, identify the sentiment of text, extract entities, and much more.

Transformers are behind much of this. These transformers are (unfortunately) not Michael Bay's Autobots and Decepticons and (fortunately) not buzzing electrical boxes. Our NLP transformers lie somewhere in the middle, they're not sentient Autobots (yet), but they can understand language in a way that existed only in sci-fi until a short few years ago.

Machines with a human-like comprehension of language are pretty helpful for organizing masses of unstructured text data. In machine learning, we refer to this task as *topic modeling*, the automatic clustering of data into particular topics.

BERTopic takes advantage of the superior language capabilities of these (not yet sentient) transformer models and uses some other ML magic like UMAP and HDBSCAN (more on these later) to produce what is one of the most advanced techniques in language topic modeling today.

🌲 Pinecone article:

🔗 Code notebooks:

🤖 70% Discount on the NLP With Transformers in Python course:

🎉 Subscribe for Article and Video Updates!

👾 Discord:

00:00 Intro
01:40 In this video
02:58 BERTopic Getting Started
08:48 BERTopic Components
15:21 Transformer Embedding
18:33 Dimensionality Reduction
25:07 UMAP
31:48 Clustering
37:22 c-TF-IDF
40:49 Custom BERTopic
44:04 Final Thoughts
Рекомендации по теме
Комментарии
Автор

I know this video is a year old but on the dimensionality reduction part. You don't reduce the dimensions just because there is too much information and you want to compress it. It's mainly due to the "curse of dimensionality" where increasing the dimensions makes distances between datapoints more and more negligible. So trying to cluster in this high dimensional space will result in arbitrary clusters because the distances are negligible.

egericke
Автор

Awesome video James! Great idea using world map to illustrate dim reduction techniques!

tomwalczak
Автор

I found you yesterday looking for a NSWG video and now this. Really nice how relevant your videos are

renatorao
Автор

Amazing. It was lot of fun to go through this!

shaheerzaman
Автор

Well prepared demo and well crafted video, thanks mate!

WouterSuren
Автор

Thank you for the video! The notebook seems to have been removed from the repository. Is it still available?

janspoerer
Автор

Thanks a million James. So clearly explained!

kantafcb
Автор

Hi James, When doing Topic Modeling using BerTopic, how do we choose the Umap's n_neighbor and n_components if we do not have already predefined topic labels like the Reddit data's Sub field for the Selfnotes field?

junchoi
Автор

Wow, your videos always save my schoolwork, if one day I become a millionaire I will give you part of my company.

ernestosantiesteban
Автор

Thank you so much. You are Amazing in explaining that lecturer. Its very understandable. I will recommend this vedio to all my friends.

waleedkhalid
Автор

Hi There, The link to colab notebook seems to show repository not found. Could you please update the link.

shameekm
Автор

Hi James, I had an issue running the umap code. I fixed it by following the steps for me
pip uninstall umap
pip install umap-learn

then the import was
import umap.umap_ as umap

Can you confirm if umap worked for you. Because it didn't work for me. I had to follow the following steps.

averma
Автор

BERTopic not working on windows. Can you please create a video just installing. I know sounds ridiculous, but I tried everything

dr.kingschultz
Автор

Thankyou very much for the tutorial sir. How the parameters are tuned in BERTopic? What is advantage of BERTopic over standard topic models such as LDA, NMF?What is the difference in the number of parameters in topic models vs. BERTopic? Please help sir.

seemarani
Автор

Hi James, thanks for the excellent set of videos. Do you know of any pretrained SentenceTransformer models that can work on longer documents?

RajeshGupta-gxyz
Автор

Can bertopic use for small datasets less than 1000 rows, or short sentences per row? will it be reliable for topic modelling, then do unap and clustering?

aizasyamimi
Автор

Does BERTopic understand the context words can be used? So in your example, Pytorch can be the most used word in a particular topic, but if the word Pytorch is used differently in different contexts within that topic, does this model pick up on that since it has the transformers?

brianferrell
Автор

Thanks ! I have one question. If I get a cluster 2 with the words: year, month, time and I want manually remove the word time from cluster 2 and put it into another cluster. Is that possible ?

wasgeht
Автор

James! Thanks for the tutorial. Very helpful. The only thing I do not know how to do is implement this with my own custom dataset. I have my data split into topics. It looks like this:
topic 1 = ["sentence 1", "sentence 2", ...],
topic 2 = ["sentence 3", "sentence 4", ...]
topic 3 = ...

How do I turn it into the correct format like

and data['title'][i] like you are using in the pretrained model.


If you could explain how to organize my data to make this work or direct me to a resource that would be amazing.

Subscribed!

BlackLightningGames
Автор

Nica Explanation. Can BERTopic work well in other languages (Hindi, URdu) for topic modelling???

sanazulfiqar