My top 50 scikit-learn tips

preview_player
Показать описание
If you already know the basics of scikit-learn, but you want to be more efficient and get up-to-date with the latest features, then THIS is the video for you.

My name is Kevin Markham, and I've been teaching Machine Learning in Python with scikit-learn for more than 8 years. Over the next 3 hours, I'm going to share with you my top 50 scikit-learn tips.

Each tip ranges from 2 to 8 minutes, and you can use the timestamp links below to skip along if you're already familiar with a particular tip.

50 TIPS:
0:00 - Introduction
1:03 - 1. Transform data with ColumnTransformer
4:19 - 2. Seven ways to select columns
8:18 - 3. "fit" vs "transform"
10:53 - 4. Don't use "fit" on new data!
15:05 - 5. Don't use pandas for preprocessing!
19:00 - 6. Encode categorical features
24:07 - 7. Handle new categories in testing data
27:16 - 8. Chain steps with Pipeline
30:19 - 9. Encode "missingness" as a feature
33:12 - 10. Why set a random state?
35:40 - 11. Better ways to impute missing values
41:22 - 12. Pipeline vs make_pipeline
44:08 - 13. Inspect a Pipeline
47:03 - 14. Handle missing values automatically
49:47 - 15. Don't drop the first categorical level
54:15 - 16. Tune a Pipeline
1:01:09 - 17. Randomized search vs grid search
1:05:42 - 18. Examine grid search results
1:08:10 - 19. Logistic regression tuning parameters
1:12:41 - 20. Plot a confusion matrix
1:15:37 - 21. Plot multiple ROC curves
1:17:21 - 22. Use the correct Pipeline methods
1:18:59 - 23. Access model coefficients
1:20:11 - 24. Visualize a decision tree
1:23:57 - 25. Improve a decision tree by pruning it
1:25:23 - 26. Use stratified sampling when splitting data
1:29:40 - 27. Impute missing values for categoricals
1:32:10 - 28. Save a model or Pipeline
1:33:47 - 29. Add multiple text columns to a model
1:35:35 - 30. More ways to inspect a Pipeline
1:37:28 - 31. Know when shuffling is required
1:42:32 - 32. Use AUC with multiclass problems
1:46:04 - 33. Create custom features with scikit-learn
1:50:03 - 34. Automate feature selection
1:52:24 - 35. Use pandas objects with scikit-learn
1:53:37 - 36. Pass parameters as keyword arguments
1:55:23 - 37. Create an interactive Pipeline diagram
1:57:22 - 38. Get the names of transformed features
1:59:32 - 39. Load a toy dataset into pandas
2:01:33 - 40. View all model parameters
2:03:00 - 41. Encode binary features
2:06:59 - 42. Column selection tricks
2:10:02 - 43. Save time when encoding categoricals
2:16:53 - 44. Speed up a grid search
2:19:01 - 45. Create feature interactions
2:23:00 - 46. Ensemble multiple models
2:27:23 - 47. Tune an ensemble
2:31:22 - 48. Run part of a Pipeline
2:34:52 - 49. Tune multiple models at once
2:39:50 - 50. Solve many ML problems with one solution
Рекомендации по теме
Комментарии
Автор

Amazing as always. I have been following you since 2019 and every time it's something new.

KartikeyRiyal
Автор

Great video, very informative. Thank you so much for sharing.

akbarboghani
Автор

Awsome resource for Machine Learning. Thanks!

maziarjamshidi
Автор

24:08 handle_unknown='ignore'. A most useful tip! If only I'd read the docs. But, I don't understand when you say to go back and include the previously unknown categories. How can you train on unknown data? Even if you include the unknown "labels" in your encoder, they will all be zero during training, because, obviously, they weren't in your training data. I think it's best to just leave it alone. If it wasn't in your training data, then it's probably a rare occurrence and you can just ignore it. Zeros in all known categories simplifies what happens down stream? If you want to train on unknown data, you would need to use "dummy data" and set min_frequency or max_categories, then to give down steam modules something to work with.

philwebb
Автор

This is a Masters level info on Data science.

uncledez
Автор

I'm a new subscriber. I'm so glad I found u amazing explanation

Ahmed_Eid
Автор

Hello Kevin!
Thank you for your great work and tips.
Could you please include in the repository notebooks for the tips that are missing? I suppose those are the ones that do not contain code.
However, it would be great to have those included in some way so nothing is missing when someone would like to do a quick review.
Again, thank you so much for your sharing!

tassoskat
Автор

2:10:00 Yeah, if you have the time and the determination, you could run DecisionTreeClassifier, then plot_tree, and look through it for conditions like name != value. Then, you could use the order the decision tree "discovers" categories as the ordinal value for that feature, 0 being first. You just need to write a custom transformer to preprocess your validation data and assign -1 to all unknowns. Another trick I've had success with is ordering by frequency, with 0 being the most frequent. In that case, your custom transformer should assign 0 to all unknowns. Easy-peasy.

philwebb
Автор

2:09:40 Hopefully, you'll never have 200 columns to passthrough, but I think specifying which columns to passthrough makes what you intend clearer. The default is remainder=drop, so the author thought that as well.

philwebb
Автор

30:20 Missingness. So, what happens when a feature is fully populated in your training data, but has missing values in your validation data? Just bringing that up in case you don't get to it.

philwebb
Автор

2:03:00 Drop=if_binary makes sense, otherwise you have two columns which are perfectly redundant, not just implied. At least, it's a happy compromise. My only hesitation, without playing with it, is that the order is probably alphabetic. If it assigned 0 to the most frequent category, then handle_unknown=ignore would make sense. Otherwise, you're lumping unknowns in with the "least" alphabetic category. That's kinda silly.

philwebb
Автор

Please I want a video of (PSO) with RF in jupyter

saharrichi
Автор

Hi
Can help me to find android spyware dataset ???

manalabughazaleh
Автор

Hey kevin, why aren't you bringing some new videos anymore? :(

FabioRBelotto