My top 50 scikit-learn tips

Показать описание

If you already know the basics of scikit-learn, but you want to be more efficient and get up-to-date with the latest features, then THIS is the video for you.

My name is Kevin Markham, and I've been teaching Machine Learning in Python with scikit-learn for more than 8 years. Over the next 3 hours, I'm going to share with you my top 50 scikit-learn tips.

Each tip ranges from 2 to 8 minutes, and you can use the timestamp links below to skip along if you're already familiar with a particular tip.

50 TIPS:
0:00 - Introduction
1:03 - 1. Transform data with ColumnTransformer
4:19 - 2. Seven ways to select columns
8:18 - 3. "fit" vs "transform"
10:53 - 4. Don't use "fit" on new data!
15:05 - 5. Don't use pandas for preprocessing!
19:00 - 6. Encode categorical features
24:07 - 7. Handle new categories in testing data
27:16 - 8. Chain steps with Pipeline
30:19 - 9. Encode "missingness" as a feature
33:12 - 10. Why set a random state?
35:40 - 11. Better ways to impute missing values
41:22 - 12. Pipeline vs make_pipeline
44:08 - 13. Inspect a Pipeline
47:03 - 14. Handle missing values automatically
49:47 - 15. Don't drop the first categorical level
54:15 - 16. Tune a Pipeline
1:01:09 - 17. Randomized search vs grid search
1:05:42 - 18. Examine grid search results
1:08:10 - 19. Logistic regression tuning parameters
1:12:41 - 20. Plot a confusion matrix
1:15:37 - 21. Plot multiple ROC curves
1:17:21 - 22. Use the correct Pipeline methods
1:18:59 - 23. Access model coefficients
1:20:11 - 24. Visualize a decision tree
1:23:57 - 25. Improve a decision tree by pruning it
1:25:23 - 26. Use stratified sampling when splitting data
1:29:40 - 27. Impute missing values for categoricals
1:32:10 - 28. Save a model or Pipeline
1:33:47 - 29. Add multiple text columns to a model
1:35:35 - 30. More ways to inspect a Pipeline
1:37:28 - 31. Know when shuffling is required
1:42:32 - 32. Use AUC with multiclass problems
1:46:04 - 33. Create custom features with scikit-learn
1:50:03 - 34. Automate feature selection
1:52:24 - 35. Use pandas objects with scikit-learn
1:53:37 - 36. Pass parameters as keyword arguments
1:55:23 - 37. Create an interactive Pipeline diagram
1:57:22 - 38. Get the names of transformed features
1:59:32 - 39. Load a toy dataset into pandas
2:01:33 - 40. View all model parameters
2:03:00 - 41. Encode binary features
2:06:59 - 42. Column selection tricks
2:10:02 - 43. Save time when encoding categoricals
2:16:53 - 44. Speed up a grid search
2:19:01 - 45. Create feature interactions
2:23:00 - 46. Ensemble multiple models
2:27:23 - 47. Tune an ensemble
2:31:22 - 48. Run part of a Pipeline
2:34:52 - 49. Tune multiple models at once
2:39:50 - 50. Solve many ML problems with one solution

Рекомендации по теме

Комментарии

Amazing as always. I have been following you since 2019 and every time it's something new.

KartikeyRiyal

Great video, very informative. Thank you so much for sharing.

akbarboghani

Awsome resource for Machine Learning. Thanks!

maziarjamshidi

24:08 handle_unknown='ignore'. A most useful tip! If only I'd read the docs. But, I don't understand when you say to go back and include the previously unknown categories. How can you train on unknown data? Even if you include the unknown "labels" in your encoder, they will all be zero during training, because, obviously, they weren't in your training data. I think it's best to just leave it alone. If it wasn't in your training data, then it's probably a rare occurrence and you can just ignore it. Zeros in all known categories simplifies what happens down stream? If you want to train on unknown data, you would need to use "dummy data" and set min_frequency or max_categories, then to give down steam modules something to work with.

philwebb

This is a Masters level info on Data science.

uncledez

I'm a new subscriber. I'm so glad I found u amazing explanation

Ahmed_Eid

Hello Kevin!
Thank you for your great work and tips.
Could you please include in the repository notebooks for the tips that are missing? I suppose those are the ones that do not contain code.
However, it would be great to have those included in some way so nothing is missing when someone would like to do a quick review.
Again, thank you so much for your sharing!

tassoskat

2:10:00 Yeah, if you have the time and the determination, you could run DecisionTreeClassifier, then plot_tree, and look through it for conditions like name != value. Then, you could use the order the decision tree "discovers" categories as the ordinal value for that feature, 0 being first. You just need to write a custom transformer to preprocess your validation data and assign -1 to all unknowns. Another trick I've had success with is ordering by frequency, with 0 being the most frequent. In that case, your custom transformer should assign 0 to all unknowns. Easy-peasy.

philwebb

2:09:40 Hopefully, you'll never have 200 columns to passthrough, but I think specifying which columns to passthrough makes what you intend clearer. The default is remainder=drop, so the author thought that as well.

philwebb

30:20 Missingness. So, what happens when a feature is fully populated in your training data, but has missing values in your validation data? Just bringing that up in case you don't get to it.

philwebb

2:03:00 Drop=if_binary makes sense, otherwise you have two columns which are perfectly redundant, not just implied. At least, it's a happy compromise. My only hesitation, without playing with it, is that the order is probably alphabetic. If it assigned 0 to the most frequent category, then handle_unknown=ignore would make sense. Otherwise, you're lumping unknowns in with the "least" alphabetic category. That's kinda silly.

philwebb

Please I want a video of (PSO) with RF in jupyter

saharrichi

Hi
Can help me to find android spyware dataset ???

manalabughazaleh

Hey kevin, why aren't you bringing some new videos anymore? :(

FabioRBelotto

My top 50 scikit-learn tips

My top 50 scikit-learn tips

Scikit-Learn - 30 minutes, 30 commands, 80% of work done ! 🔥🔥🔥

Use This Way Of Training Machine Learning Models For Efficiency

I can't STOP reading these Machine Learning Books!

Scikit-learn Crash Course - Machine Learning Library for Python

THIS is HARDEST MACHINE LEARNING model I've EVER coded

Machine Learning with Python and Scikit-Learn – Full Course

Hands-On Hyperparameter Tuning with Scikit-Learn: Tips and Tricks

Real-World Python Machine Learning Tutorial w/ Scikit Learn (sklearn basics, NLP, classifiers, etc)

Andrew Ng's Secret to Mastering Machine Learning - Part 1 #shorts

My First Machine Learning Project (SciKit-Learn) | 2020

Comparing machine learning models in scikit-learn

DO NOT start your MACHINE LEARNING journey like this...

Have you used pipelines for Machine Learning before? #shorts

Multiple Time Series Forecasting With Scikit-Learn

Feature Selection In Machine Learning | Feature Selection Techniques With Examples | Simplilearn

My top 25 pandas tricks

5 ways to improve accuracy of machine learning model😎.

10 ML algorithms in 45 minutes | machine learning algorithms for data science | machine learning

Adapt this pattern to solve many Machine Learning problems

Machine Learning for Everybody – Full Course

Building a Machine Learning API in 15 Minutes | Coding Challenge

Machine Learning Explained in 100 Seconds

Process HUGE Data Sets in Pandas