Cosine Similarity ← Natural Language Processing ← Socratica

Показать описание

𝙄𝙣𝙩𝙧𝙤𝙙𝙪𝙘𝙞𝙣𝙜 𝙎𝙤𝙘𝙧𝙖𝙩𝙞𝙘𝙖 𝘾𝙊𝙐𝙍𝙎𝙀𝙎

Cosine Similarity is a way to compare two pieces of text (docs) to see how similar they are stylistically. This is a useful technique from Natural Language Processing, a growing subfield of AI & Machine Learning. In this lesson, we review how to use the bag of words technique to turn a piece of text into a vector, then show how the 'cosine similarity' measure is a useful way to compare two docs. As a concrete application, we compare 10 different classic novels from different authors and time periods to see how well the cosine similarity measure performs.

𝙔𝙤𝙪 𝙘𝙖𝙣 𝙟𝙪𝙢𝙥 𝙩𝙤 𝙨𝙚𝙘𝙩𝙞𝙤𝙣𝙨 𝙤𝙛 𝙩𝙝𝙚 𝙫𝙞𝙙𝙚𝙤 𝙝𝙚𝙧𝙚:
0:00 Intro
0:48 Prerequisites
1:43 The Big Idea
3:39 Cosine Similarity
4:42 Example setup
5:47 The Books
6:51 Building a Feature Vector
8:56 Writing the Functions
10:08 Computing Cosine similarities
11:30 No Stop Words
12:50 Analysis
14:00 No Nouns

𝙒𝘼𝙏𝘾𝙃 𝙉𝙀𝙓𝙏:
Bag of Words

Use Mathematica for Free

BTW—Socratica offers a pro course, 'Mathematica Essentials,' providing key concepts for mastering Wolfram products:

Thank you to our VIP Patreon Members who helped make this video possible!
José Juan Francisco Castillo Rivera
KW
M Andrews
Jim Woodworth
Marcos Silveira
Christopher Kemsley
Eric Eccleston
Jeremy Shimanek
Michael Shebanow
Alvin Khaled
Kevin B
John Krawiec
Umar Khan
Tracy Karin Prell
— Thank you kind friends! 💜🦉

✷✷✷
We recommend the following (affiliate links):
The Wolfram Language

The Mythical Man Month - Essays on Software Engineering & Project Management

Innumeracy: Mathematical Illiteracy and Its Consequences

Mindset by Carol Dweck

How to Be a Great Student (our first book!)

✷✷✷
If you find our work at Socratica valuable, please consider becoming our Patron on Patreon!

If you would prefer to make a one-time donation, you can also use
Socratica Paypal

✷✷✷
Written & Produced by Michael Harrison & Kimberly Hatch Harrison
Edited by Megi Shuke

About our Instructors:

Michael earned his BS in Math from Caltech, and did his graduate work in Math at UC Berkeley and University of Washington, specializing in Number Theory. A self-taught programmer, Michael taught both Math and Computer Programming at the college level. He applied this knowledge as a financial analyst (quant) and as a programmer at Google.

Kimberly earned her BS in Biology and another BS in English at Caltech. She did her graduate work in Molecular Biology at Princeton, specializing in Immunology and Neurobiology. Kimberly spent 16+ years as a research scientist and a dozen years as a biology and chemistry instructor.

Michael and Kimberly Harrison co-founded Socratica.
Their mission? To create the education of the future.

✷✷✷

PLAYLISTS

#cosinesimilarity #AI #naturallanguageprocessing

Рекомендации по теме

Комментарии

Fantastic video, great to see another Socratica video in my feed

MakeDataUseful

So cool, being able to find similarities in books from neighboring time periods was fascinating.

juanmacias

Keep em coming, the courses are looking good too.

jagadishgospat

This is phenomenal! Here I was, thinking we were just going to talk about the cos a = a approximation in trig. Bonus!

Insightfill

Very useful info, and the approach was excellent, very fun too

OPlutarch

I don't like removing "stop words" from the statistics, because their frequency is still meaningful. Even though everybody uses the word "the" frequently, some use it much more than others; and that is some characteristic that should not be ignored.
So rather, I would suggest performing some kind of "normalization"; like dividing each word count by the average occurrence rate of that particular word in natural language.
Instead of just word counts, the vector coordinates will consist of relative use rate of the particular word in the book compared the average use rate in general language.
That would make a much more precise comparison. Because not just "stop words" are very common, some words are inherently much more common than others.

Although I did not make the experiment, I suspect that in this way, everything will have a much lower cosine similarity.

ahmedouerfelli

That's a very intuitive and helpful explanation, thank you. But pray tell, prithee even, is not some relationship between words in individual sentences what we would prefer (smaller angels)? It seems odd to me that when creating embeddings we're focused on these huge arcs rather than the smaller arcs that build understanding on a more basic level. The thresh-hold for AI in GPT 3 seems to have been on a huge amount of text, but isn't there some way to make that smaller? For most of us, that's the only way we can even contribute, as we just don't have the computer-hardware.

AndrewMilesMurphy

pretty good, but the visualization of the results could have been made in something other than a table. That way, you wouldn't have to explain why the diagonal is 1, and that every number appears twice (mirrored along the diagonal). You'd end up with just 45 rather than 100 datapoints, and then compare the "top 10" across the different measurements. This would be much easier to follow.

danielschmider

Cosine Similarity ← Natural Language Processing ← Socratica

The Cosine Similarity for NLP and CatBoost

Cosine Similarity ← Natural Language Processing ← Socratica

Understand Cosine Similarity | 2 Minute Tutorial

Cosine Similarity for NLP - Clearly Explained

Calculating Text Similarity in Python with NLP

TF-IDF Document Similarity using Cosine Similarity

Vector 4 Cosine Similarity

Cosine Distance vs Euclidean Distance in Machine Learning and NLP with Word2Vec or Glove Vectors

Cosine similarity, cosine distance explained | Math, Statistics for data science, machine learning

Cosine similarity in Nlp

Natural Language Programming - Cosine Similarity

Proj6 - NLP Cosine Similarity

Cosine Similarity | Cosine Distance | TFIDF vectorization | Bag of words Model | NLP tutorial

Natural Language Processing (NLP)-Basics-Part 15: Word Embedding - Cosine Similarity

NLP 02 : String Similarity, Cosine Similarity, Levenshtein Distance

Machine learning 2: Cosine Similarity in NLP

Chatbot Implementation NLTK Python Cosine Similarity Natural Language Processing NLP tutorial

Cosine Similarity and Cosine Distance

Cosine Similarity

9. Cosine Similarity & OpenAI Embeddings | Natural Language Processing | Generative AI

#69 | Cosine Similarity | Machine Learning

Cosine Similarity | NLTK | Day 05 | NLP Tutorial | Python

🤗 Tasks: Sentence Similarity

02 - Embeddings and Cosine Similarity