Python Tutorial: Highlighting data

Показать описание

---
Welcome! In this course, you will learn how to use Python to craft compelling and efficient data visualizations.

First, a tiny bit about me. I have had the pleasure of working as a data scientist and visualizer at places like Johns Hopkins Data Science Lab, the New York Times and Vanderbilt university.

My work focuses on building visualizations for large complex datasets and model results.

Why is data visualization a crucial component of data science? If you've decided to take this course already you probably don't need me to explain to you the virtues of data visualization. But for the sake of completeness: data visualization helps take the raw data and results coming out of your data science workflow and turns them into tactile and intuitive physical representations.

The improvements brought by data visualization can be purely cosmetic, such as converting a simple table of results to a nicer looking chart.

Or completely necessary for understanding the data, such as visualizing geographic patterns in thousands of geo-located datapoints.

To make the visualizations in this course, we will use a combination of matplotlib and Seaborn. To get the most out of this course, you should be familiar with these packages along with basic data manipulation with pandas. The prerequisite courses for this course are fantastic resources for gaining these skills if you don't already have them.

Just because this course is taught with a few specific packages doesn't mean the lessons aren't valuable to users of other tools. Short of specific methods to perform some actions, the techniques and best-practices taught in this course can be applied in your chosen data science platform, such as R, spreadsheets or any other of the myriad of tools available.

Throughout the first three chapters of this course, we will use a fascinating dataset from the EPA on pollution levels across the United States.

Each row of the data contains the maximum observed air-pollution values for four different pollutants: carbon monoxide (CO), nitrogen dioxide (NO2), ozone (O3), and sulfur dioxide (SO2).

There is data for eleven cities across the United States, with data for every day of the year for 2014-2016.

Now on to visualization techniques. The first technique we're going to talk about is simple and is often extremely effective: highlighting specific data points to draw attention to them.

Often when you display a lot of data to an audience, you get caught in a conundrum. You want to show all of your data, but you also want people to focus on a specific interesting point or set of points.

The easiest way to show all the data is to do just that, plot all of the data points, but how do you avoid a specific point of interest getting lost - like in this scatter of NO2 against SO2 values? The simplest solution to this problem is to highlight.

Here we see how we use a Python list comprehension to improve the plot from the previous slide by adding a highlight to our point of interest. In this case, the point corresponding to the 38th day of the year.

To highlight a point in Seaborn or matplotlib, we supply our plotting function with a vector of colors corresponding to each row of our data; making a vector with uniform colors for every point except the ones we want to highlight.

We can see our specific point clearly now.

This point could be a particularly interesting outlier, or more frequently it may represent something important outside of the data visualization, such as the point corresponding to your company in the midst of its competitors.

Those familiar with matplotlib may ask why we aren't just drawing a second scatter plot of just our highlighted points. By using an array-based method, we can easily add more highlights to the plot or generate the highlights programmatically.

Okay! Let's get to making some highlighted plots with our pollution data.

#PythonTutorial #Python #DataCamp #Improving #Data #Visualizations #Highlighting