Python Code for BERT Paragraph Vector Embedding w/ Transformers (PyTorch, Colab)

preview_player
Показать описание
BERT "Paragraph Vector Embedding" in a high-dim vector space where semantic similar sentences/paragraphs/texts are close by!

Part 1 of this video is called:
How to code BERT Word + Sentence Vectors (Embedding) w/ Transformers?

Plus a simple PCA visualization, using a BERT-base model where three semantic clusters become visible in PCA 2D, but ... (hint) at a vector space with more than 1000 dimensions, I would recommend a UMAP dimensional reduction and HDBSCAN for clustering.

Great informative sources and Colab NB (I reference to them):

Principal component analysis (PCA).
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.

Limitation of PCA:
PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data.
PCA can capture linear correlations between the features but fails when this assumption is violated. It is not optimized for class separability.

#datascience
#machinelearningwithpython
#embedding
#vectorspace
Рекомендации по теме
Комментарии
Автор

In physics there is the concept of cosmic inflation, so I understand that correlation with the economic sentences.

BryantAvila
Автор

Really, , !!!! This vedio is powerfull help to me, Thanks you !! Have a nice day, HAHA,

가상민-xn
Автор

Have you thought in making a video to fine tuning BERT in domain adaptation to generate sentence embedding?

wilfredomartel
Автор

I have accident reports data where each accident contain multiple sentences, how can I find a paragraph embeddings for each report that is composed of multiple sentences? any idea?

abiddanish