StatQuest: PCA - Practical Tips

preview_player
Показать описание

In it, I give practical advice about the need to scale your data, the need to center your data, and how many principal components you should expect to get.

For a complete index of all the StatQuest videos, check out:

If you'd like to support StatQuest, please consider...

...or...

...buying one of my books, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...

...or just donating to StatQuest!

Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:

0:00 Awesome song and introduction
0:47 Make sure the data are on the same scale
2:53 Make sure the data are centered
3:30 How to determine the number of principal components

#statquest #PCA #ML
Рекомендации по теме
Комментарии
Автор

This is a gold mine for Data Scientist, Data Engineer, ML/DL engineer. I can hardly think of anyone else that can teach the same concept more clearly.

buihung
Автор

Dear Josh. I had so much issues with stats as I am from a totally different background. Watching Ur videos helped me overcome my insecurities. Thank you so much.

geethanjalikannan
Автор

Josh's videos are so cool that I usually like them before watching.

caperucito
Автор

All prof in the world need to learn how to teach from you ! Thanks !

Jason-xett
Автор

The intro with you singing is so cute, made me smile...

iloveno
Автор

Your initial music always make me smile😂😂

shwetankagrawal
Автор

Thank you for all the amazing videos. I would be having a really hard time without them

bendiknyheim
Автор

Thank you so much for, basically, all your videos on PCA

jesusfranciscoquevedoosegu
Автор

This PCA series (step-by-step, practical tips, then R) is brilliant! I found them very helpful. Thank you for these great videos!

Would you be considering to do a series on factor analysis?

kby--b
Автор

Hey Josh thx so much for your videos... 3 quick questions:

1. 7:54 says "if there are fewer samples than variables, the number of samples puts an upper bound on the number of PCs with eigenvalues greater than 0", but in the example there, the number of samples is equal to the number of variables, not less. Should the statement be "if # of samples <= variables (...)" then the upper bound applies?

2. In the same section above, there are only 2 PCs for 3 samples. From the initial prediction it seems like there could only be 2 PCs since 3 points make a plane. Could there be 3 PCs for 3 samples, or is the sample upper bound always # of samples - 1?

3. To clarify, 7:05 says "since we only have 2 points we only have 1 PC" so I can have a single PC with a slope in a much higher dimension? Since this PC would be in R3, that's okay, correct

paulotarso
Автор

Hi, Josh, , I am a little confusing that at 2:37, you mentioned using standard deviation, well, if we have math scores(0-100) with standard deviation of 5 and, in the same time, the reading scores(0-10) also has sd of 5, then by dividing sd, math and reading are still NOT in the same scale.

Patrick
Автор

Thanks for the video, but I think there is a simple mistake at @2:08 when you said mix 0.77 Math with 0.77 Reading, I thought that both must add up to 1, or I got something wrong ?

mostafael-tager
Автор

Great Video Josh!

I am wondering @ 7:32 "Find the line perpendicular to PC1 that fits best" what does this means?
I mean either you can have line perpendicular or a best fit line.

sane
Автор

Thank you so much for switching to Math and reading, cause the genes and cells things were giving headaches. Nevertheless; Thank you so much for your efforts ♥♥

boultifnidhal
Автор

Very nice videos. Have you considered a segment on kernel PCA?

johnfinn
Автор

I have a trivial question at 1:39 . If the recipe to make PC1 is using approx 10 parts Math and only 1 part reading, why does that mean that Math is '10' times more important than Reading to explain the variation in data? I mean I understand that it will be more important but is that specific number (10) correct?

doubletoned
Автор

Hi Josh. Thanks for your videos, especially when you are diving into details and tips.
In tip#2 concerning centering, you show 2 sets of 3 points and you present the centering to the mean. Let's imagine an experiment with 3 patients with drug A and 3 patients with drugs A and B. Let's say the lower/left set if the reference, drug A, and the upper/right set is the test, drug A+B. What about centering on A (set A will be at the origin)? This centering should show the total effect of adding drug B to drug A, whereas the mean centering shows half the effect. In the same vein, the variables plot should show the variables that change from drugA set to drugAB set instead of showing variables that change from the mean experiment ie ((drugA+drugAB)/2). What's your view?

samggfr
Автор

Hello Josh. @ 7:57, you explained that if there are fewer samples than variables then the number of samples puts an upper bound on the number of PCs. In the last example, there are 3 samples and 3 variables (therefore the number of samples isn't fewer than the number of variables), and the number of PCs should be 3 (not 2). could you explain why did you decide that the number of PCs should be 2!!. (BTW I watched all of your videos about PCA, but I don't understand this specific example).

basharabdulrazeq
Автор

At 6:19, even the two points are on a line, but does the line necessarily go through (0, 0)? If not, there still can be two PCs. Can you help clarify? Thanks.

mrweisu
Автор

Josh, Thank you very much for helping us out with stats. When i get a job, I sure should contribute towards your efforts.
I am struggling to understand things @3:10
Why should it be a problem if we do NOT centre the data ?
Can you please explain with respect to your "PCA -Clearly Explained" Video. My Prof would't answer it. So asking a Cool-Stat-Guru about it :)
If it requires too much eleboration please point me to other resources.... Thanks Again.
Best Wishes from India... :)

kushaltm
join shbcf.ru