Design Matrices For Linear Models, Clearly Explained!!!

preview_player
Показать описание

For a complete index of all the StatQuest videos, check out:

If you'd like to support StatQuest, please consider...

...or...

...buy my book, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...

...or just donating to StatQuest!

Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:

#statquest #glm #statistics
Рекомендации по теме
Комментарии
Автор

Statistics never appealed to me since it always seemed boring ... until I started watching your videos a few days ago. Now I'm hooked. Thanks for making statistics so fun and intuitive to learn!

macroxela
Автор

SQ is so addictive. A simple concept clarification youtube search led me down hours and hours of SQ contents. Thank you, thank you, thank you!

fanzhang
Автор

WOWW. I have been watching this video for at least 5 times and I always learned something new!
I was confused by people saying "regress out the batch effect", but it's that simple!!!
Thanks Josh.

taotaotan
Автор

thank you! I can't believe how clear you are explaining this, seriously thank you!

leylayim
Автор

im so happy when i look for a topic and see that you've covered it.

MsDontBlink
Автор

"turning something on by letting it be" - some proper life advice there

paulpaschert
Автор

I never get tired of watching your videos, I have learned a lot. This is my favorite channel :)
Would you consider making a video on assessing the significance of mixed models? Please! this topic is complicated

laurag.
Автор

6:06, having flashbacks to week one and two of Andew Ng's ML Coursera course, but now it feels more intuitive!

GregSteg
Автор

In the last part where you combine the linear regression and the t-test, you have a regression line for each category, but the slopes of the lines are identical. Isn't this rare? How would the equation change if you had two lines with different slopes?

dbarkan
Автор

All the topics of statquest are well explained. Thank you sir for this nice statistics subject based channel Statquest.Good wishes and happy journey for this successful statquest youtube channel.

sudinroy
Автор

wow so helpful, this cleared my doubt of combining and interpreting categorical and continuous predictors. Thanks a ton:)

amoghbharadwaj
Автор

BAM!!! crystal clear explanation! Thanks!

jaychan
Автор

7:46
Compare mean model vs type-only model, p>0.05

12:17
Compare size-weight-type model vs type-only model: p=0.0025

karannchew
Автор

Great video (as usual)! You're definitely one of my favorite "thing-explainers" I've come across :D

I was left with a question near the end though, with respect to "correcting for batch effects". After a quick online search, I see this is usually an issue and many packages to attempt to correct it.

I could imagine two explanations that lead to different explanations for the difference:
i) "We ran the exact same protocol in two different labs. However, the sensors were differently calibrated, so there is a bias in readouts." -> This suggests the batch-effect correction.
ii) "We ran the exact same protocol in two different labs. We ensured the sensors were equally calibrated, but there's *still* a bias in readouts." -> This could just be due to inherent variability in the sample, right? (It is probably not *too* likely for the data to be the same, just shifted down a bit. But it's possible!

Questions:
1) This correction *assumes* that the difference in batches is *not* due to inherent variability in the features we're measuring (but is instead due to e.g. technician error), right? There would be no way to *prove* it one way or the other, would there?

2) If it's (ii), wouldn't "correcting for batch effects" throw out useful information about the response variable's distribution?

3) Ideally, hopefully both labs calibrated their sensors via e.g. blanks, so (1) shouldn't be immediately the reason. How would you suggest teasing out sensor bias (1) vs sample variability (2)? Would we have to assume a model for the data and compare whether Lab A's two group's parameters significantly differ from Lab B's? (Or maybe the "ideal" situation happens infrequently enough that going for (1) is usually not unreasonable?)


Thanks again! Will continue to Quest On :D

dainegai
Автор

So clear!!! Thank you for answering my confusion in a such simple way!

summerxia
Автор

I love u, its out now, but i need a longer intro. :)

gren
Автор

do a statquest for wald's test, chi- squared test and fisher's exact test please!!

parthbhardwaj
Автор

2:40 It might be because, the standard needs only one bit to represent both values, since only change is 2nd bit. We can just ignore the 1st bit while storing thus reducing the size.
Just a speculation.

sharan
Автор

First of all props for your excellent series of videos. First rate introduction to some really hard stuff.

An answer to your question at 2:46 is that the problem in general implies two distinct linear equations but algorithms for linear models (eg lm() function in R) only allow for one general linear equation. So yes you can solve it by hand using two separate linear equations but the algorithm won't let you enter the problem in that format.

So how do you get around this problem of having only one linear equation to work with when you have two linear equations in reality?

Ans: Break up the linear model by using dummy variables in a thoughtful manner
or let the algorithm do it for you but check that it's not messing with you.

Either way you got to know what's going on and here's one explanation....
 
For the mutant/control example we have the following linear model:
(Note i should be read as a subscript for the ith term and e is the error term. So ei is NOT some madness in the complex plane but simply the ith error term. If ei throws you just treat it as a symbol related to irreducible error (the noise that is always around). That's all it is.)

yi = B0 + B1 xi + ei (eq 1) (pretty much y = b + mx with e as some reality thrown in)

where B1 is the slope of the line with xi as its associated input values (eg the labels mutant and control but as values in this example) and B0 is the y-intercept. As you can see there are no input values associated with B0 so we can not directly associate input values to B0 through the first column of the design matrix. This explains why the first column of the design matrix is fixed to all ones. This is essentially saying that B0 exists for all i and it's up to the gods of regression to determine what B0 becomes.

All is not lost though. Nothing says we can't monkey with the linear model (eq 1) through its variable xi in a creative way that ends up associating B0 with a label. And that's what we're going to do. But first we need to deal with the issue that our labels are not numbers and this creates an opening for some linear equation monkey business without defying the gods.

Since our equation won't work on labels we need to assign numerical values (dummy variables) and by selecting the appropriate dummy variables for our labels, we can separate the general equation into two separate equations each of which corresponds uniquely to each label. Word of caution though, how we select our dummy variables determines how our labels get assigned to the separate equations, so it's not an arbitrary choice.

So let's try the following:

xi = 1 if i is a mutant
xi = 0 if i is a control

(0 and 1 are the dummy variables and here we are assigning actual numerical values to xi. These are the values assigned in the second column of the design matrix)

in which case yi = B0 + B1 xi + ei (eq 1) becomes

yi = B0 + B1 + ei if i is a mutant (eq 2)
yi = B0 + ei if i is a control (eq 3)

(Notice that there is no longer any separate xi term in eqs 2 & 3 since xi has been assigned dummy variable values)

and this allows us to interpret our controls relative to B0 whereas our mutants correspond to B0 + B1. Pretty slick and no lightning bolts from above.

In this case B0 is the mean for the controls (intercept in the summary report), whereas B0 + B1 is the mean for the mutants. It's important to note that B1 (what is returned second in the summary report) is the mean difference between mutants and controls (ie mean of mutants-mean of controls). If the p-value for B1 is significant that means adding the the difference of mutants-controls to our model is significant with respect to the control alone (ie mutants are different relative to the controls and what we're interested in).

Now if we switched our dummy variables, xi=1 for controls and xi=0 for mutants, then B0 would be the mean for mutants; B0 + B1 is the mean for the controls; and B1 is the mean of controls-mean of mutants (ie got reversed). If the p-value for B1 is significant here that means adding the difference of controls-mutants to our model is significant with respect to MUTANTS alone (ie controls are different relative to mutants so equivalent to what we want but is kinda upside down and weird). To get totally weird we could assign xi=1 for mutants and xi=-1 for controls then B0 would be the overall average for the combination of mutants and controls.

Bottom line: How we set our dummy variables determines how we can interpret B0 (as well as B0+B1 and B1) and is a slick trick that allows us to separate out from our linear model, two linear equations that uniquely correspond to our labels.

An Introduction to Statistical Learning by James Gareth gives some nice examples of this on pg 84 at this level of math. Available for free on-line and also provides details on how to assess the quality of your model which is critical.

And if you've gotten this far....Hey Josh, how about some banjo??? Some Ola Belle Reed would fit nicely endured, I've endured, how long can one endure!!!!

briankirk
Автор

13:32 About the term for difference(mutant - control), is that an average of Lab A's difference(mutant - control) and Lab B's difference(mutant - control)?

Russet_Mantle
join shbcf.ru