The most common mistake in biostatistics

preview_player
Показать описание

*Technical side note: zero-inflated simply means you have a distribution with an abnormal number of zeroes. To model zero-inflated data, it's common to use models that combine logistic regression (to predict the 0 vs 1) and logistic regression (to predict 1+). So it's technically incorrect to say that a distribution that has this characteristic (combining yes/no with degree) is "zero-inflated." In other words, zero-inflated describes the distribution, not how the distribution is modeled. Make sense?
Рекомендации по теме
Комментарии
Автор

I'm not a Statistician or a Biostatistician, and I'm not even good at Math, but your explanation was so crystal clear even I can understand it. Sweet!
And I've had Senior Level Management folk - VPs, SVPs - from major Big Pharma companies ask to keep hacking away at data that plain as daylight like the Continuous Variable Distribution you showed in this video, and I keep asking myself: "am I so stupid? Am I missing something obvious?"
After all, the data is being summarized and showing whatever its showing, but somehow the big folks want it to show something else. And I'm always like "what else do you want it to show? It is what it is!" Of course, I swallow my pride and hide my impatience because maybe, just maybe, I'm really stupid.
But after months of slicing and dicing data into invisible chunks, it always comes back to where I started. Scary!
Thanks again for making advanced topics palatable for myself and others like me. It gives us hope.

trini-rtxn
Автор

Medical doctors are indoctrinated to think of the world in terms of decision points and cutoffs. This is why the doctor demands discretization of a continuous response. He wants to have a decision point, where test results above that point indicate treatment. Continuous distributions are much more challenging to deal with. If you have a guy whose test score is in the middle of the pack, what do you do, give him half the treatment? What if the treatment is a surgical procedure? This is why doctors demand cutoffs. They are not morons, they just have a different set of priorities and constraints.

PATRICKCONNOLLY-ubvb
Автор

Yeah, an "optimal cutoff" requires a well-defined optimization problem. It requires an objective function to be either minimized or maximized. Vaguely pointing at a continuous empirical distribution does not constitute such clarity.

galenseilis
Автор

I used the book by Imbens and Rubin (2016) to measure treatment effects in my MSc thesis. It's a sub-classification, based on the Propensity score (PS), a continuous variable. The sample is split on the median such that the average PS of the treated is equal to the average PS of the controls in each stratum. The results are somewhat sensitive on how the sample is stratified, but the stratification is done using a very specific algorithm. I would be interested to hear your thoughts on that book.

Note that the PS in my sample was analytically derived, not estimated.

yiannisspanos
Автор

This is a Nobel Price in Languages right here.

hamidjess
Автор

The inflection idea to make language continuous is something we already do. Daft and Lengel talk about media richness. Papers are less rich than talking because tone and inflection don’t come through in a paper. That’s a huge simplification of their premise, but that’s the gist.

zimmejoc
Автор

It is interesting considering how to model a random variable X that quantifies an extent but also have a variable Y is an indicator variable that gives us whether there was any extent to X at all. If we have labels then we can model X * Y. Without labels I think a mixture could be reasonable.

If you have labels for some but not all of the data then that sounds like a missing data problem. There you should consider whether the missingness is MCAR, MAR, or MNAR. If the former two, then model-driven imputation is may be possible. All-the-better if you impute a probability distribution over the missing values rather than filling in just one value.

galenseilis
Автор

As always, the vlog is excellent. It brings to mind a quote from Frank Harrell on categorisation: ‘Employ it when the intention is to mislead the reader" ;-)

antoniobarros
Автор

I am really interested in the continuous along with categorical findings when we have to make a decision, as you said. I would really appreciate it if I could find a paper that demonstrate this approach, or utilize it.

anasbit
Автор

I think what actually stops me from ever using median splits is that the decisions I help people make with statistics don't involve the median. It just doesn't have any relevance on the problems I work on.

galenseilis
Автор

I've encountered that many tend to create categorical variables to use as predictors in logistic regression models, so that the value on the logit scale can be easily interpreted as an odds ratio. But what they don't realize is that the values can be recoded to keep the continuous distribution of the variable, but transformed it so that the value of 0 can indicate the value of say the bottom 25th percentile and the value of 1 can equal the value at the upper 25th percentile. Now in theory you are still interpreting the values as if they were a binary variable, but at least you do not lose statistical power by capping the natural variability of an informative covariate

McDreamyn_mdphd
Автор

“But previous literature did” hm ok yeah let’s shy away from that excuse

royals
Автор

I always liked Fuzzy Logic as a framework for interpreting linguistic categorical concepts continuously and vice versa. 'Fuzzification' and 'Defuzzification' are permanent fixtures in my toolbox for when I need to explain the ideas that you talk about here.

NicholasBerryman-zsip
Автор

Please also discuss the problems when categorical data are analysed as continuous data. Thank you for your videos.❤

nl
Автор

I could see someone trying to partition the data if they saw a bimodal distribution and no apparent labels to explain that bimodality, but I would still prefer a mixture model. A mixture model allows the assignment of probabilities to the apparent subpopulations.

galenseilis
Автор

Totally agree in theory, but docs love ORs and the Titanic turns slowly. How can I better communicate interpretability of betas if I keep the outcome continuous? “For each year older the kiddo is, we see delay to initial imaging increase by 1.6 days.” The blank stares haunt my dreams.

swinginkeke
Автор

I categorized my gene expression into low medium and high because we have no idea how to analyze something without resorting to pvalue comparisons (is there a difference of the mean, plug in whatever model you have depending on normality).

planetary-rendez-vous
Автор

If all you know is ANOVA, what would you do instead?

DistortedV
Автор

Diagnostic criteria require an optimal cutoff. Those cutoffs are not arbitrary or determined by one dataset (the focus of researchers). Clinicians often conceptualize the data continuously (e.g., pre-diabetic, higher risk for cardiovascular disease, pre-clinical risk for stress-mediated chronic disease development), but patients want to know if they have a condition or not (category). Clinical scientists don't categorize everything because we only know how to use ANOVAs, but what a condescending standpoint. Eliminating categorical cutoffs eliminates diagnoses. I'm good with that, but really, as a patient, are you?

fruithillfarm
Автор

Myanmese (or Burman, depending on who you ask).

tulipped