Cosine Similarity โ† Natural Language Processing โ† Socratica

preview_player
ะŸะพะบะฐะทะฐั‚ัŒ ะพะฟะธัะฐะฝะธะต
๐™„๐™ฃ๐™ฉ๐™ง๐™ค๐™™๐™ช๐™˜๐™ž๐™ฃ๐™œ ๐™Ž๐™ค๐™˜๐™ง๐™–๐™ฉ๐™ž๐™˜๐™– ๐˜พ๐™Š๐™๐™๐™Ž๐™€๐™Ž

Cosine Similarity is a way to compare two pieces of text (docs) to see how similar they are stylistically. This is a useful technique from Natural Language Processing, a growing subfield of AI & Machine Learning. In this lesson, we review how to use the bag of words technique to turn a piece of text into a vector, then show how the 'cosine similarity' measure is a useful way to compare two docs. As a concrete application, we compare 10 different classic novels from different authors and time periods to see how well the cosine similarity measure performs.

๐™”๐™ค๐™ช ๐™˜๐™–๐™ฃ ๐™Ÿ๐™ช๐™ข๐™ฅ ๐™ฉ๐™ค ๐™จ๐™š๐™˜๐™ฉ๐™ž๐™ค๐™ฃ๐™จ ๐™ค๐™› ๐™ฉ๐™๐™š ๐™ซ๐™ž๐™™๐™š๐™ค ๐™๐™š๐™ง๐™š:
0:00 Intro
0:48 Prerequisites
1:43 The Big Idea
3:39 Cosine Similarity
4:42 Example setup
5:47 The Books
6:51 Building a Feature Vector
8:56 Writing the Functions
10:08 Computing Cosine similarities
11:30 No Stop Words
12:50 Analysis
14:00 No Nouns

๐™’๐˜ผ๐™๐˜พ๐™ƒ ๐™‰๐™€๐™“๐™:
Bag of Words

Use Mathematica for Free

BTWโ€”Socratica offers a pro course, 'Mathematica Essentials,' providing key concepts for mastering Wolfram products:

Thank you to our VIP Patreon Members who helped make this video possible!
Josรฉ Juan Francisco Castillo Rivera
KW
M Andrews
Jim Woodworth
Marcos Silveira
Christopher Kemsley
Eric Eccleston
Jeremy Shimanek
Michael Shebanow
Alvin Khaled
Kevin B
John Krawiec
Umar Khan
Tracy Karin Prell
โ€” Thank you kind friends! ๐Ÿ’œ๐Ÿฆ‰

โœทโœทโœท
We recommend the following (affiliate links):
The Wolfram Language

The Mythical Man Month - Essays on Software Engineering & Project Management

Innumeracy: Mathematical Illiteracy and Its Consequences

Mindset by Carol Dweck

How to Be a Great Student (our first book!)

โœทโœทโœท
If you find our work at Socratica valuable, please consider becoming our Patron on Patreon!

If you would prefer to make a one-time donation, you can also use
Socratica Paypal

โœทโœทโœท
Written & Produced by Michael Harrison & Kimberly Hatch Harrison
Edited by Megi Shuke

About our Instructors:

Michael earned his BS in Math from Caltech, and did his graduate work in Math at UC Berkeley and University of Washington, specializing in Number Theory. A self-taught programmer, Michael taught both Math and Computer Programming at the college level. He applied this knowledge as a financial analyst (quant) and as a programmer at Google.

Kimberly earned her BS in Biology and another BS in English at Caltech. She did her graduate work in Molecular Biology at Princeton, specializing in Immunology and Neurobiology. Kimberly spent 16+ years as a research scientist and a dozen years as a biology and chemistry instructor.

Michael and Kimberly Harrison co-founded Socratica.
Their mission? To create the education of the future.

โœทโœทโœท

PLAYLISTS

#cosinesimilarity #AI #naturallanguageprocessing
ะ ะตะบะพะผะตะฝะดะฐั†ะธะธ ะฟะพ ั‚ะตะผะต
ะšะพะผะผะตะฝั‚ะฐั€ะธะธ
ะะฒั‚ะพั€

Fantastic video, great to see another Socratica video in my feed

MakeDataUseful
ะะฒั‚ะพั€

So cool, being able to find similarities in books from neighboring time periods was fascinating.

juanmacias
ะะฒั‚ะพั€

Keep em coming, the courses are looking good too.

jagadishgospat
ะะฒั‚ะพั€

This is phenomenal! Here I was, thinking we were just going to talk about the cos a = a approximation in trig. Bonus!

Insightfill
ะะฒั‚ะพั€

Very useful info, and the approach was excellent, very fun too

OPlutarch
ะะฒั‚ะพั€

I don't like removing "stop words" from the statistics, because their frequency is still meaningful. Even though everybody uses the word "the" frequently, some use it much more than others; and that is some characteristic that should not be ignored.
So rather, I would suggest performing some kind of "normalization"; like dividing each word count by the average occurrence rate of that particular word in natural language.
Instead of just word counts, the vector coordinates will consist of relative use rate of the particular word in the book compared the average use rate in general language.
That would make a much more precise comparison. Because not just "stop words" are very common, some words are inherently much more common than others.

Although I did not make the experiment, I suspect that in this way, everything will have a much lower cosine similarity.

ahmedouerfelli
ะะฒั‚ะพั€

That's a very intuitive and helpful explanation, thank you. But pray tell, prithee even, is not some relationship between words in individual sentences what we would prefer (smaller angels)? It seems odd to me that when creating embeddings we're focused on these huge arcs rather than the smaller arcs that build understanding on a more basic level. The thresh-hold for AI in GPT 3 seems to have been on a huge amount of text, but isn't there some way to make that smaller? For most of us, that's the only way we can even contribute, as we just don't have the computer-hardware.

AndrewMilesMurphy
ะะฒั‚ะพั€

pretty good, but the visualization of the results could have been made in something other than a table. That way, you wouldn't have to explain why the diagonal is 1, and that every number appears twice (mirrored along the diagonal). You'd end up with just 45 rather than 100 datapoints, and then compare the "top 10" across the different measurements. This would be much easier to follow.

danielschmider
join shbcf.ru