Python Tutorial: Exploratory data analysis

preview_player
Показать описание

---
In this video, I will show you how we can use exploratory data analysis to help identify data that need further investigation.

The most basic analysis we can do is count the unique values in our data.

We can use the info method to get the data type of each column. Here, I will show you the frequency counts for the non-numeric continent, country and fertility columns, and the numeric population column.

To perform a frequency count, we first select the column we want to perform a frequency count on.

If the column name does not contain any special characters, spaces, and is not a name of a Python function, we can select a column directly by its name using dot notation.

It works the same way as subsetting using bracket notation.

Once we have the column selected, we can use the value_counts method on the selected column.

I like to use the dropna equals False parameter since it will also count the number of missing values if there are any.

The continent column does not have a missing value, so none will be reported.

value_counts will print out the counts for each unique value of a column in descending order.

Note that even though we counted a column of the object dtype, the results of value_counts will be of dtype int.

Another way we can select columns is using the bracket notation. Here is the same code and output as before, this time using the bracket notation to select a column.

Now we will count the number of observations for each country in our data. Since there are too many countries to show at once, I am using the head method to only return the top 5 counts.

In this example I am chaining together methods, I am slicing and getting the value counts just like before.

We expect each country to have only 1 observation, but Sweden has 2. This will require us to investigate this data point further.

The fertility column is a column we expected to be numeric, but stored as a string.

This is because we have a string value named missing in this column, this is why the fertility column has the wrong data type. it also alerts us that we need to recode the missing string.

If your column has missing values, they will also be counted, provided you pass the dropna equals False parameter. Here you see that we have 42 missing values in the column.

Another type of EDA we can do is calculate summary statistics on numeric columns.

This can help us spot outliers in our data.

There are many working definitions for outliers. One definition is a value that is considerably higher (or lower) than the rest of the data. You can consult the DataCamp statistics course for more detailed definitions of outliers.

Outliers are observations of interest we want to investigate further for data cleaning.

We can quickly calculate summary statistics on our data by using the describe method. Only the columns that have a numerical type will be returned.

describe returns the number of non-missing values, the mean, standard deviation, minimum, 25, 50, and 75 percent quartiles of our data, where the 50% quartile is our median, and finally, the maximum value of our data.

A quick scan down the population results, show that the maximum population value is 2-point-3 billion people. Our data comes from 2012, no country had this population then.

Now it's your turn to calculate descriptive statistics for exploratory data analysis to see what needs cleaning in your data.

#PythonTutorial #DataCamp #Cleaning #Data #Python #Exploratory #dataanalysis
Рекомендации по теме