How do I encode categorical features using scikit-learn?

preview_player
Показать описание
In order to include categorical features in your Machine Learning model, you have to encode them numerically using "dummy" or "one-hot" encoding. But how do you do this correctly using scikit-learn?

In this video, you'll learn how to use OneHotEncoder and ColumnTransformer to encode your categorical features and prepare your feature matrix in a single step. You'll also learn how to include this step within a Pipeline so that you can cross-validate your model and preprocessing steps simultaneously. Finally, you'll learn why you should use scikit-learn (rather than pandas) for preprocessing your dataset.

AGENDA:
0:00 Introduction
0:22 Why should you use a Pipeline?
2:30 Preview of the lesson
3:35 Loading and preparing a dataset
6:11 Cross-validating a simple model
10:00 Encoding categorical features with OneHotEncoder
15:01 Selecting columns for preprocessing with ColumnTransformer
19:00 Creating a two-step Pipeline
19:54 Cross-validating a Pipeline
21:44 Making predictions on new data
23:43 Recap of the lesson
24:50 Why should you use scikit-learn (rather than pandas) for preprocessing?

WANT TO JOIN MY NEXT LIVE WEBCAST? Become a member ($5/month):

=== RELATED RESOURCES ===

=== WANT TO GET BETTER AT MACHINE LEARNING? ===

4) LET'S CONNECT!
Рекомендации по теме
Комментарии
Автор

The Legendary Data Science guy is back!

terryhenyo
Автор

For beginners:
When I tried to complete an ML project of say a simple model based on Logistic or Linear regression it used to take me about a month. As I was a beginner in Python, Pandas, SQL and the rest of it, I thought this will take me a long time to master and may be I am a late comer into this.
But a year forward now and thanks to Data School, Sentdex, Krish naik, Statquest, Thinkful Webinar and more I am surprised that all I need is a day or less to complete these projects.
Because of the meticulous analysis on Data School when I needed a deeper understanding that's where my gps leads me to.
Thank you Data School.

GoredGored
Автор

Your guideline does not only involves basic codes, but it actually involves very practical and useful functions. I want to sincerely thank you for your effort!

altunbikubra
Автор

There is something about your explanations, that i just get it instantly. You deserve an award

liquid_absabs
Автор

OMG!!! I’ve just started ML in kaggle for the past few weeks. Theres a lot of information to absorb but you teach us in the most understandable way and yet up-to-date question why we should use scikit instead of using dummies. This video is extremely helpful and informative. Thank you alot!!! Guess I gonna spend the rest of the day to watch all of your videos

hieungotrung
Автор

Thanks, this helps a lot. Was scratching my head on pipeline and column transformer before this video.
Also you got a very soothing voice and it helps to relax and really enjoy the learning.

nyk
Автор

00:58
1) It allows you to properly cross-validate a process rather than just a model. In other words, when you are doing cross-validation like cross_val_score, normally you just pass a model to it. Well, there are cases when that is not going to give you accurate results because you're doing the preprocessing outside of the cross-validation.
So a pipeline, generally speaking, is useful because you can cross-validate a process that includes
(a) *preprocessing* as well as
(b) *model building*.

fet
Автор

did I not stumble upon ur videos earlier

rommeltito
Автор

Preprocessing with pipeline was complex topic to understand for me before watching this video. Thanks a lot for the video.

harshalkulkarni
Автор

I was looking for clear explanation of Pipeline for a long time. You nailed it. Crystal clear explanation and understood by watching one time. Thank you.

Rationalist-Forever
Автор

Thank you for speaking slowly. It’s nice to listen to a non-English speaking person

Putinka
Автор

Man I love you. I just love you. I love your videos. I love the way you explain things. I love the pace of you videos. I love everything. Thank you.

harshitarawat
Автор

you are the best tutor i have ever met , keep up the good work. Thank you

chr
Автор

just want to say thank you. I am a beginner and you teach much better than my professor.

Steven-sejd
Автор

THANK YOU for this tutorial! Was wandering around the web to solve unexpected errors that came by following, apparently, outdated tutorials. If I have landed up on this tutorial the very first time, it would have saved me around 4 hours of useless surfing. Thanks again

amitsharma
Автор

My god I love your detailed solution. Even my 5yo sibling can understand it. Wonderful. Definitely worth a subscribe.

quocanhhbui
Автор

Thx kevin, one of best & simplest explanations of pipeline

Tothefutureand
Автор

You are a high quality TEACHER, thank you very much.

christianiheanacho
Автор

Thank you for this tutorial. I was working with logistic regression this week and was trying to figure out how to one hot encode for a categorical variable with hundreds of categories. I was getting 100% accuracy and precision so something wasn’t right. I’m going to try the steps that you outlined in this tutorial. Thanks.

jkore
Автор

Sir, just before 5 minutes I visited our channel to ask you the same question where it was difficult for me to encode multivariables in kaggles house prediction using advanced regression dataset. Fortunately and surprisingly you posted same. Thank you so much.

JainmiahSk