Python Tutorial: Transforming categorical variables

preview_player
Показать описание

---
Now that we know what are the categorical variables in our dataset we can start transforming them into numerical.

To transform a categorical variable into numeric, we have to understand it's type first. There are two types of categorical variables: ordinal and nominal. Ordinal variables have two or more categories that can be ranked or ordered. In our case that is the **salary** column, where the values clearly have a logical order.
The 2nd type is Nominal, where categories do not have any intrnisic or logical order. An example of this kind of variable in our dataset is the column **department**, as its values clearly do not have any order or rank: sales department is not higher than hr or viceversa and so on.
Based on what type of categorical variable you have, there are different methods for transforming them.

For the case of ordinal variables we can encode categories by converting each of them into a respective numeric value. There are 3 steps to accomplish that tasks in Python.
- First, we have to tell Python, that the column salary is actually categorical. This is done using a method called **astype()** which is providing the type of the variable.

The next categorical variable is nominal, as there is no order or rank between departments. This means that encoding approach is not useful anymore. In this case, transformation should be accomplished trough the so called dummy variables.

Dumym variables are the variables that get only two values 0 or 1. Let's say an employee is from the technical department. This means if we have a searate column for each department, then the mentioned employee will have value of 1 in the column for technical and 0 in the columns of all other departments.

This means we will have to create a new dataframe where each department is a separate column and each row is a separate employee with 1s in front of his/her department and 0 in all other places. While the task seems to be confusing, it is very easy from technical perspective due to a very nice function from pandas called **get_dummies()**.

When dealing with dummy variables one should be cautious of a phenomenon known as dummy trap. The latter is the situation when different dummy varaibles convey the same information. In this example, the sample employee is from the technical department, so it is the only column with a value of 1 in the first table. In the 2nd table, the last column is dropped, but we can still understand that the employee is from technical department by looking at all the other departments that have value of 0. For that reason, whenever in similar situations dummies are created one of them can be dropped as its information is already included in others.

Ok, time to put this into practice.

#DataCamp #PythonTutorial #Human #Resources #Analytics #Predicting #Employee #Churn #Python
Рекомендации по теме
welcome to shbcf.ru