BERTopic Explained

Показать описание

90% of the world's data is unstructured. It is built by humans, for humans. That's great for human consumption, but it is *very* hard to organize when we begin dealing with the massive amounts of data abundant in today's information age.

Organization is complicated because unstructured text data is not intended to be understood by machines, and having humans process this abundance of data is wildly expensive and *very slow*.

Fortunately, there is light at the end of the tunnel. More and more of this unstructured text is becoming accessible and understood by machines. We can now search text based on *meaning*, identify the sentiment of text, extract entities, and much more.

Transformers are behind much of this. These transformers are (unfortunately) not Michael Bay's Autobots and Decepticons and (fortunately) not buzzing electrical boxes. Our NLP transformers lie somewhere in the middle, they're not sentient Autobots (yet), but they can understand language in a way that existed only in sci-fi until a short few years ago.

Machines with a human-like comprehension of language are pretty helpful for organizing masses of unstructured text data. In machine learning, we refer to this task as *topic modeling*, the automatic clustering of data into particular topics.

BERTopic takes advantage of the superior language capabilities of these (not yet sentient) transformer models and uses some other ML magic like UMAP and HDBSCAN (more on these later) to produce what is one of the most advanced techniques in language topic modeling today.

🌲 Pinecone article:

🔗 Code notebooks:

🤖 70% Discount on the NLP With Transformers in Python course:

🎉 Subscribe for Article and Video Updates!

👾 Discord:

00:00 Intro
01:40 In this video
02:58 BERTopic Getting Started
08:48 BERTopic Components
15:21 Transformer Embedding
18:33 Dimensionality Reduction
25:07 UMAP
31:48 Clustering
37:22 c-TF-IDF
40:49 Custom BERTopic
44:04 Final Thoughts

Рекомендации по теме

Комментарии

I know this video is a year old but on the dimensionality reduction part. You don't reduce the dimensions just because there is too much information and you want to compress it. It's mainly due to the "curse of dimensionality" where increasing the dimensions makes distances between datapoints more and more negligible. So trying to cluster in this high dimensional space will result in arbitrary clusters because the distances are negligible.

egericke

Awesome video James! Great idea using world map to illustrate dim reduction techniques!

tomwalczak

I found you yesterday looking for a NSWG video and now this. Really nice how relevant your videos are

renatorao

Amazing. It was lot of fun to go through this!

shaheerzaman

Well prepared demo and well crafted video, thanks mate!

WouterSuren

Thank you for the video! The notebook seems to have been removed from the repository. Is it still available?

janspoerer

Thanks a million James. So clearly explained!

kantafcb

Hi James, When doing Topic Modeling using BerTopic, how do we choose the Umap's n_neighbor and n_components if we do not have already predefined topic labels like the Reddit data's Sub field for the Selfnotes field?

junchoi

Wow, your videos always save my schoolwork, if one day I become a millionaire I will give you part of my company.

ernestosantiesteban

Thank you so much. You are Amazing in explaining that lecturer. Its very understandable. I will recommend this vedio to all my friends.

waleedkhalid

Hi There, The link to colab notebook seems to show repository not found. Could you please update the link.

shameekm

Hi James, I had an issue running the umap code. I fixed it by following the steps for me
pip uninstall umap
pip install umap-learn

then the import was
import umap.umap_ as umap

Can you confirm if umap worked for you. Because it didn't work for me. I had to follow the following steps.

averma

BERTopic not working on windows. Can you please create a video just installing. I know sounds ridiculous, but I tried everything

dr.kingschultz

Thankyou very much for the tutorial sir. How the parameters are tuned in BERTopic? What is advantage of BERTopic over standard topic models such as LDA, NMF?What is the difference in the number of parameters in topic models vs. BERTopic? Please help sir.

seemarani

Hi James, thanks for the excellent set of videos. Do you know of any pretrained SentenceTransformer models that can work on longer documents?

RajeshGupta-gxyz

Can bertopic use for small datasets less than 1000 rows, or short sentences per row? will it be reliable for topic modelling, then do unap and clustering?

aizasyamimi

Does BERTopic understand the context words can be used? So in your example, Pytorch can be the most used word in a particular topic, but if the word Pytorch is used differently in different contexts within that topic, does this model pick up on that since it has the transformers?

brianferrell

Thanks ! I have one question. If I get a cluster 2 with the words: year, month, time and I want manually remove the word time from cluster 2 and put it into another cluster. Is that possible ?

wasgeht

James! Thanks for the tutorial. Very helpful. The only thing I do not know how to do is implement this with my own custom dataset. I have my data split into topics. It looks like this:
topic 1 = ["sentence 1", "sentence 2", ...],
topic 2 = ["sentence 3", "sentence 4", ...]
topic 3 = ...

How do I turn it into the correct format like

and data['title'][i] like you are using in the pretrained model.

If you could explain how to organize my data to make this work or direct me to a resource that would be amazing.

Subscribed!

BlackLightningGames

Nica Explanation. Can BERTopic work well in other languages (Hindi, URdu) for topic modelling???

sanazulfiqar

BERTopic Explained

How to use BERTopic - Machine Learning Assisted Topic Modeling in Python

Code demo of BERTopic - BERTopic for Topic Modeling

BERTopic Explained

BERTopic for Topic Modeling - Maarten Grootendorst - Talking Language AI Ep#1

Short texts vs long texts in BERTopic- BERTopic for Topic Modeling

API Design Philosophy - BERTopic for Topic Modeling

Topic Modeling using Sentence Transformers - BERTopic explained in detail

BERTopic: Topic Modeling using Transformers in NLP #Shorts

How People can help BERTopic - BERTopic for Topic Modeling

What is BERT and how does it work? | A Quick Review

BERTopic: Topic Modeling by Combining the Old with the New

Transformer models and BERT model: Overview

An Introduction to Topic Modeling

What is BERT? | Deep Learning Tutorial 46 (Tensorflow, Keras & Python)

Maarten Grootendorst on BERTopic - Weaviate Podcast #28

Intro to PolyFuzz - BERTopic for Topic Modeling

Transformers, explained: Understand the model behind GPT, BERT, and T5

Intro to KeyBERT - BERTopic for Topic Modeling

Topic Modeling with Llama 2

Topic Modeling with BERTopic | arxiv-dataset | NLP | Data Science | Machine Learning | HuggingFace

BERTopic : Topic Modelling with Transformer Embeddings , arxiv dataset python demo #NLP #tutorial

Demystifying Clustering in Topic Modeling Algorithms like BERTopic | DataHour by Abhiram Ravikumar

Topic Modelling using Python BERTTopic

BERTopic: A New Approach to Topic Modeling in NLP