Python Tutorial: Visual exploratory data analysis

Показать описание

---
Visualizing our data is a great way to spot outliers and obvious errors in our data. Data visualization is more than just looking at patterns and reporting data. They can also be an effective way to spot outliers, and plan your data cleaning pipeline.

Here's the tabular form of summary statistics for a clean European Environment Agency dataset.

You can imagine how overwhelming the results can be on a dataset with many more columns.

This is why it is important to visualize our data because abnormal data points, such as the max population, would immediately stand out.

We use bar plots to count discrete data, and histograms to count continuous data.

These plots will give us the ability to look at frequencies of our data, which can be used to look for potential errors.

Here's an example of a histogram. We take the dataframe and column of interest, and call the plot method. We pass 'hist' into the method to have pandas create a histogram.

We have to make sure matplotlib is loaded before we can show the plot.

Here is our histogram.

The x-axis shows the range of values that are counted, and the y-axis is how many observations in our data are in a particular range of values.

Our histogram shows 2 observations between 1 and 1-point-5 billion, and 1 observation above 2 billion.

This dataset comes from 2012. At the time it was impossible for a country to have over 2 billion people.

We can slice our data to look for all data points where the population is greater than 1 billion people. We use bracket notation to slice our data, and inside we specify the condition we are interested in: countries with populations greater than 1 billion people. Our dataset reports Australia having 2-point-3 billion people.

This is a data error.

China and India are also outliers in our data, but those values are correct, these countries are actual outliers.

Not all outliers are bad data points. Some can be an error, but others are valid values.

Let's see what this dataset looks like in other visualizations.

Box plots are a good way to visualize all the basic summary statistics into a single figure that we can use to quickly compare across multiple categories.

You can see the DataCamp statistics courses for more information.

We can spot outliers, the min/max of our data, and the 25th, 50th, and 75th percentiles of our data.

We create box plots by calling the boxplot method on the dataframe.

To plot just the population column, we pass it into the column parameter.

We pass the by parameter the column name we want to compare boxplots across. Here I want a separate boxplot for each continent in our data.

Remember to show the plot.

Here is the box plot of our population column.

Most of our data is represented by the box. The lines that extend from the box are called whiskers. The ends of the whiskers show the maximum and minimum of our data excluding outliers, and outliers are the values shown beyond the whiskers.

You can see the 3 massive outliers of our data. One of which we know is bad data, the other 2, are actual outliers.

Scatter plots are used to look at the relationships between 2 numeric columns; in the exercises, you will use such methods to flag potentially bad data. You will also use scatter plots to identify candidates for bad data which you would not see when plotting histograms or box-plots, such as countries with low literacy and low fertility.

#PythonTutorial #DataCamp #Cleaning #Data #Python #Visual #exploratory #dataanalysis