filmov
tv
BERTopic Explained
Показать описание
90% of the world's data is unstructured. It is built by humans, for humans. That's great for human consumption, but it is *very* hard to organize when we begin dealing with the massive amounts of data abundant in today's information age.
Organization is complicated because unstructured text data is not intended to be understood by machines, and having humans process this abundance of data is wildly expensive and *very slow*.
Fortunately, there is light at the end of the tunnel. More and more of this unstructured text is becoming accessible and understood by machines. We can now search text based on *meaning*, identify the sentiment of text, extract entities, and much more.
Transformers are behind much of this. These transformers are (unfortunately) not Michael Bay's Autobots and Decepticons and (fortunately) not buzzing electrical boxes. Our NLP transformers lie somewhere in the middle, they're not sentient Autobots (yet), but they can understand language in a way that existed only in sci-fi until a short few years ago.
Machines with a human-like comprehension of language are pretty helpful for organizing masses of unstructured text data. In machine learning, we refer to this task as *topic modeling*, the automatic clustering of data into particular topics.
BERTopic takes advantage of the superior language capabilities of these (not yet sentient) transformer models and uses some other ML magic like UMAP and HDBSCAN (more on these later) to produce what is one of the most advanced techniques in language topic modeling today.
🌲 Pinecone article:
🔗 Code notebooks:
🤖 70% Discount on the NLP With Transformers in Python course:
🎉 Subscribe for Article and Video Updates!
👾 Discord:
00:00 Intro
01:40 In this video
02:58 BERTopic Getting Started
08:48 BERTopic Components
15:21 Transformer Embedding
18:33 Dimensionality Reduction
25:07 UMAP
31:48 Clustering
37:22 c-TF-IDF
40:49 Custom BERTopic
44:04 Final Thoughts
Organization is complicated because unstructured text data is not intended to be understood by machines, and having humans process this abundance of data is wildly expensive and *very slow*.
Fortunately, there is light at the end of the tunnel. More and more of this unstructured text is becoming accessible and understood by machines. We can now search text based on *meaning*, identify the sentiment of text, extract entities, and much more.
Transformers are behind much of this. These transformers are (unfortunately) not Michael Bay's Autobots and Decepticons and (fortunately) not buzzing electrical boxes. Our NLP transformers lie somewhere in the middle, they're not sentient Autobots (yet), but they can understand language in a way that existed only in sci-fi until a short few years ago.
Machines with a human-like comprehension of language are pretty helpful for organizing masses of unstructured text data. In machine learning, we refer to this task as *topic modeling*, the automatic clustering of data into particular topics.
BERTopic takes advantage of the superior language capabilities of these (not yet sentient) transformer models and uses some other ML magic like UMAP and HDBSCAN (more on these later) to produce what is one of the most advanced techniques in language topic modeling today.
🌲 Pinecone article:
🔗 Code notebooks:
🤖 70% Discount on the NLP With Transformers in Python course:
🎉 Subscribe for Article and Video Updates!
👾 Discord:
00:00 Intro
01:40 In this video
02:58 BERTopic Getting Started
08:48 BERTopic Components
15:21 Transformer Embedding
18:33 Dimensionality Reduction
25:07 UMAP
31:48 Clustering
37:22 c-TF-IDF
40:49 Custom BERTopic
44:04 Final Thoughts
Комментарии