filmov
tv
Python Tutorial: Handling missing values
Показать описание
---
In the previous lesson you were introduced to the two null value types that you encounter in python. In this lesson, you will assign null values to the missing values in the dataset!
Missing values in a dataset aren't usually left unfilled, they are filled with dummy values like 'NA', '-' or '.' etc.
In this lesson, you will learn to detect such missing values as well as replace them with 'NaN'.
The first step in analyzing the dataset is to read and print a snippet of the dataset. We'll print the head of the 'college' DataFrame.
Find that all columns have float values.
If you observe clearly, you can see that a few data points are filled with a period! This suggests that missing values might be represented by a period.
However, we can confirm this only through further analysis. We'll use the info() method to get a gist of the dataset.
Hey, somethings' odd here! All the columns except 'private' are of 'object' type although they are supposed to be float.
We can further explore and confirm by finding the unique values in one of the columns. This way we can find any non-numerical values!
If you again check the 'info()' of 'college', you'll find that all the columns are now 'float64' type. This is great!
Now, let's consider another dataset to detect hidden missing values.
We will use the Pima Indians Diabetes dataset which contains various clinical diagnostic information of the patients from the Pima community. While loading the dataset we can observe 'NaN' values for missing data when you print the head of the DataFrame.
As before, let's print the 'info()' of the 'diabetes' DataFrame. They are all 'float' or 'int' type as expected.
Further, we can analyze using the 'describe()' method on the 'diabetes' DataFrame.
Observe closely. Something very odd here is that the 'BMI' column has a minimum value of 0. But we are aware that BMI cannot be 0. Hence, the 0's must rather be missing values in disguise!
To confirm the same, we can filter all the rows where 'BMI' is 0. There are 11 rows which have BMI as 0. They must be missing values.
These types of missing values can be tricky as they require some level of domain knowledge.
Great! Now that we have successfully removed the hidden missing values and replaced them with 'NaN's, let's summarize what we learned in this lesson!
We learned to detect missing value characters like '.', detect the inherent missing values within the data like '0' and replace them with NaNs.
In the next lesson, you'll dig deeper
into analyzing the missing values. But it's now time to practice!
#PythonTutorial #DataCamp #Dealing #Missing #Data #Python